Switching system

ABSTRACT

A system and method for provided a switch system ( 100 ) having a first configurable set of processor elements ( 102 ) to process storage resource connection requests ( 104 ), a second configurable set of processor elements capable of communications with the first configurable set of processor elements ( 102 ) to receive, from the first configurable set of processor elements, storage connection requests representative of client requests, and to route the requests to at least one of the storage elements ( 104 ), and a configurable switching fabric ( 106 ) interconnected between the first and second sets of processor elements ( 102 ), for receiving at least a first storage connection request ( 104 ) from one of the first set of processor elements ( 102 ), determining an appropriate one of the second set of processors for processing the storage connection request ( 104 ), automatically configuring the storage connection request in accordance with a protocol utilized by the selected one of the second set of processors, and forwarding the storage connection request to the selected one of the second set of processors for routing to at least one of the storage elements.

INCORPORATION BY REFERENCE/PRIORITY CLAIM

[0001] Commonly owned U.S. provisional application for patent Ser. No.60/245,295 filed Nov. 2, 2000, incorporated by reference herein; and

[0002] Commonly owned U.S. provisional application for patent Ser. No.60/301,378 filed Jun. 27, 2001, incorporated by reference herein.

[0003] Additional publications are incorporated by reference herein asset forth below.

FIELD OF THE INVENTION

[0004] The present invention relates to digital information processing,and particularly to methods, systems and protocols for managing storagein digital networks.

BACKGROUND OF THE INVENTION

[0005] The rapid growth of the Internet and other networked systems hasaccelerated the need for processing, transferring and managing data inand across networks.

[0006] In order to meet these demands, enterprise storage architectureshave been developed, which typically provide access to a physicalstorage pool through multiple independent SCSI channels interconnectedwith storage via multiple front-end and back-end processors/controllers.Moreover, in data networks based on IP/Ethernet technology, standardshave been developed to facilitate network management. These standardsinclude Ethernet, Internet Protocol (IP), Internet Control MessageProtocol (ICMP), Management Information Block (MIB) and Simple NetworkManagement Protocol (SNMP). Network Management Systems (NMSs) such as HPOpen View utilize these standards to discover and monitor networkdevices. Examples of networked architectures are disclosed in thefollowing patent documents, the disclosures of which are incorporatedherein by reference: U.S. Pat. No. 5,941,972 Crossroads Systems, Inc.U.S. Pat. No. 6,000,020 Gadzoox Network, Inc. U.S. Pat. No. 6,041,381Crossroads Systems, Inc. U.S. Pat. No. 6,061,358 McData Corporation U.S.Pat. No. 6,067,545 Hewlett-Packard Company U.S. Pat. No. 6,118,776 VixelCorporation U.S. Pat. No. 6,128,656 Cisco Technology, Inc. U.S. Pat. No.6,138,161 Crossroads Systems, Inc. U.S. Pat. No. 6,148,421 CrossroadsSystems, Inc. U.S. Pat. No. 6,151,331 Crossroads Systems, Inc. U.S. Pat.No. 6,199,112 Crossroads Systems, Inc. U.S. Pat. No. 6,205,141Crossroads Systems, Inc. U.S. Pat. No. 6,247,060 Alacritech, Inc. WO01/59966 Nishan Systems, Inc.

[0007] Conventional systems, however, do not enable seamless connectionand interoperability among disparate storage platforms and protocols.Storage Area Networks (SANs) typically use a completely different set oftechnology based on Fibre Channel (FC) to build and manage storagenetworks. This has led to a “re-inventing of the wheel” in many cases.Users are often require to deal with multiple suppliers of routers,switches, host bus adapters and other components, some of which are notwell-adapted to communicate with one another. Vendors and standardsbodies continue to determine the protocols to be used to interfacedevices in SANs and NAS configurations; and SAN devices do not integratewell with existing IP-based management systems.

[0008] Still further, the storage devices (Disks, RAID Arrays, and thelike), which are Fibre Channel attached to the SAN devices, typically donot support IP (and the SAN devices have limited IP support) and thestorage devices cannot be discovered/managed by IP-based managementsystems. There are essentially two sets of management products—one forthe IP devices and one for the storage devices.

[0009] Accordingly, it is desirable to enable servers, storage andnetwork-attached storage (NAS) devices, IP and Fibre Channel switches onstorage-area networks (SAN), WANs or LANs to interoperate to provideimproved storage data transmission across enterprise networks.

[0010] In addition, among the most widely used protocols forcommunications within and among networks, TCP/IP (TCP/Internet Protocol)is the suite of communications protocols used to connect hosts on theInternet. TCP provides reliable, virtual circuit, end-to-end connectionsfor transporting data packets between nodes in a network. Implementationexamples are set forth in the following patent and other publications,the disclosures of which are incorporated herein by reference: U.S. Pat.No. 5,260,942 IBM U.S. Pat. No. 5,442,637 ATT U.S. Pat. No. 5,566,170Storage Technology Corporation U.S. Pat. No. 5,598,410 StorageTechnology Corporation U.S. Pat. No. 5,598,410 Storage TechnologyCorporation U.S. Pat. No. 6,006,259 Network Alchemy, Inc. U.S. Pat. No.6,018,530 Sham Chakravorty U.S. Pat. No. 6,122,670 TSI Telsys, Inc. U.S.Pat. No. 6,163,812 IBM U.S. Pat. No. 6,178,448 IBM

[0011] Although TCP is useful, it requires substantial processing by thesystem CPU, thus limiting throughput and system performance. Designershave attempted to avoid this limitation through various inter-processorcommunications techniques, some of which are described in theabove-cited publications. For example, some have offloaded TCPprocessing tasks to an auxiliary CPU, which can reside on an intelligentnetwork interface or similar device, thereby reducing load on the systemCPU. However, this approach does not eliminate the problem, but merelymoves it elsewhere in the system, where it remains a single chokepointof performance limitation.

[0012] Others have identified separable components of TCP processing andimplemented them in specialized hardware. These can include calculationor verification of TCP checksums over the data being transmitted, andthe appending or removing of fixed protocol headers to or from suchdata. These approaches are relatively simple to implement in hardware tothe extent they perform only simple, condition-invariant manipulations,and do not themselves cause a change to be applied to any persistent TCPstate variables. However, while these approaches somewhat reduce systemCPU load, they have not been observed to provide substantial performancegains.

[0013] Some required components of TCP, such as retransmission of a TCPsegment following a timeout, are difficult to implement in hardware,because of their complex and condition-dependent behavior. For thisreason, systems designed to perform substantial TCP processing inhardware often include a dedicated CPU capable of handling theseexception conditions. Alternatively, such systems may decline to handleTCP segment retransmission or other complex events and instead defertheir processing to the system CPU.

[0014] However, a major difficulty in implementing such “fast path/slowpath” solutions is ensuring that the internal state of the TCPconnections, which can be modified as a result of performing theseoperations, is consistently maintained, whether the operations areperformed by the “fast path” hardware or by the “slow path” system CPU.

[0015] It is therefore desirable to provide methods, devices and systemsthat simplify and improve these operations.

[0016] It is also desirable to provide methods, devices and systems thatsimplify management of storage in digital networks, and enable flexibledeployment of NAS, SAN and other storage systems, and Fibre Channel(FC), IP/Ethernet and other protocols, with storage subsystem andlocation independence.

SUMMARY OF THE INVENTION

[0017] The invention addresses the noted problems typical of prior artsystems, and in one aspect, provides a switch system having a firstconfigurable set of processor elements to process storage resourceconnection requests, a second configurable set of processor elementscapable of communications with the first configurable set of processorelements to receive, from the first configurable set of processorelements, storage connection requests representative of client requests,and to route the requests to at least one of the storage elements, and aconfigurable switching fabric interconnected between the first andsecond sets of processor elements, for receiving at least a firststorage connection request from one of the first set of processorelements, determining an appropriate one of the second set of processorsfor processing the storage connection request, automatically configuringthe storage connection request in accordance with a protocol utilized bythe selected one of the second set of processors, and forwarding thestorage connection request to the selected one of the second set ofprocessors for routing to at least one of the storage elements.

[0018] Another aspect of the invention provides methods, systems anddevices for enabling data replication under NFS servers.

[0019] A further aspect of the invention provides mirroring of NFSservers using a multicast function.

[0020] Yet another aspect of the invention provides dynamic contentreplication under NFS servers.

[0021] In another aspect, the invention provides load balanced NAS usinga hashing or similar function, and dynamic data grooming and NFS loadbalancing across NFS servers.

[0022] The invention also provides, in a further aspect, domain sharingacross multiple FC switches, and secure virtual storage domains (SVSD).

[0023] Still another aspect of the invention provides TCP/UDPacceleration, with IP stack bypass using a network processors (NP). Thepresent invention simultaneously maintaining TCP state information inboth the fast path and the slow path. Control messages are exchangedbetween the fast path and slow path processing engines to maintain statesynchronization, and to hand off control from one processing engine toanother. These control messages can be optimized to require minimalprocessing in the slow path engines (e.g., system CPU) while enablingefficient implementation in the fast path hardware. This distributedsynchronization approach significantly accelerates TCP processing, butalso provides additional benefits, in that it permits the creation ofmore robust systems.

[0024] The invention, in another aspect, also enables automaticdiscovery of SCSI devices over an IP network, and mapping of SNMPrequests to SCSI.

[0025] In addition, the invention also provides WAN mediation caching onlocal devices.

[0026] Each of these aspects will next be described in detail, withreference to the attached drawing figures.

BRIEF DESCRIPTION OF THE DRAWINGS

[0027]FIG. 1 depicts a hardware architecture of one embodiment of theswitch system aspect of the invention.

[0028]FIG. 2 depicts interconnect architecture useful in the embodimentof FIG. 1.

[0029]FIG. 3 depicts processing and switching modules.

[0030]FIG. 4 depicts software architecture in accordance with oneembodiment of the invention.

[0031]FIG. 5 depicts detail of the client abstraction layer.

[0032]FIG. 6 depicts the storage abstraction layer.

[0033]FIG. 7 depicts scaleable NAS.

[0034]FIG. 8 depicts replicated local/remote storage.

[0035]FIG. 9 depicts a software structure useful in one embodiment ofthe invention.

[0036]FIG. 9a depicts the MIC and MLAN components of FIG. 9

[0037]FIG. 9b depicts the MLAN, LIC and SRC-NAS and fabric components ofFIG. 9

[0038]FIG. 9c depicts the SRC-Mediator component and fabric of FIG. 9

[0039]FIG. 10 depicts system services.

[0040]FIG. 11 depicts a management software overview.

[0041]FIG. 12 depicts a virtual storage domain.

[0042]FIG. 13 depicts another virtual storage domain.

[0043]FIG. 14 depicts configuration processing boot-up sequence.

[0044]FIG. 15 depicts a further virtual storage domain example.

[0045]FIG. 16 is a flow chart of NFS mirroring and related functions.

[0046]FIG. 17 depicts interface module software.

[0047]FIG. 18 depicts an flow control example.

[0048]FIG. 19 depicts hardware in an SRC.

[0049]FIG. 20 depicts SRC NAS software modules.

[0050]FIG. 21 depicts SCSI/UDP operation.

[0051]FIG. 22 depicts SRC software storage components.

[0052]FIG. 23 depicts FC originator/FC target operation.

[0053]FIG. 24 depicts load balancing NFS client requests between NFSservers.

[0054]FIG. 25 depicts NFS receive micro-code flow.

[0055]FIG. 26 depicts NFS transmit micro-code flow.

[0056]FIG. 27 depicts file handle entry into multiple server lists.

[0057]FIG. 28 depicts a sample network configuration in anotherembodiment of the invention.

[0058]FIG. 29 depicts an example of a virtual domain configuration.

[0059]FIG. 30 depicts an example of a VLAN configuration.

[0060]FIG. 31 depicts a mega-proxy example.

[0061]FIG. 32 depicts device discovery in accordance with another aspectof the invention.

[0062]FIG. 33 depicts SNMP/SCSI mapping.

[0063]FIG. 34 SCSI response/SNMP trap mapping.

[0064]FIG. 35 depicts data structures useful in another aspect of theinvention.

[0065]FIG. 36 depicts mirroring and load balancing operation.

[0066]FIG. 37 depicts server classes.

[0067]FIGS. 38A, 38B, 38C depict mediation configurations in accordancewith another aspect of the invention.

[0068]FIG. 39 depicts operation of mediation protocol engines.

[0069]FIG. 40 depicts configuration of storage by the volume manager inaccordance with another aspect of the invention.

[0070]FIG. 41 depicts data structures for keeping track of virtualdevices and sessions.

[0071]FIG. 42 depicts mediation manager operation in accordance withanother aspect of the invention.

[0072]FIG. 43 depicts mediation in accordance with one practice of theinvention.

[0073]FIG. 44 depicts mediation in accordance with another practice ofthe invention.

[0074]FIG. 45 depicts fast-path architecture in accordance with theinvention.

[0075]FIG. 46 depicts IXP packet receive processing for mediation.

DETAILED DESCRIPTION OF THE INVENTION I. Overview

[0076]FIG. 1 depicts the hardware architecture of one embodiment of aswitch system according to the invention. As shown therein, the switchsystem 100 is operable to interconnect clients and storage. As discussedin detail below, storage processor elements 104 (SPs) connect tostorage; IP processor elements 102(IP) connect to clients or otherdevices; and a high speed switch fabric 106 interconnects the IP and SPelements, under the control of control elements 103.

[0077] The IP processors provide content-aware switching, loadbalancing, mediation, TCP/UDP hardware acceleration, and fastforwarding, all as discussed in greater detail below. In one embodiment,the high speed fabric comprises redundant control processors and aredundant switching fabric, provides scalable port density and ismedia-independent. As described below, the switch fabric enablesmedia-independent module interconnection, and supports low-latency FibreChannel (F/C) switching. In an embodiment of the invention commerciallyavailable from the assignee of this application, the fabric maintainsQoS for Ethernet traffic, is scalable from 16 to 256 Gbps, and can beprovisioned as fully redundant switching fabric with fully redundantcontrol processors, ready for 10 Gb Ethernet, InfiniBand and the like.The SPs support NAS (NFS/CIFS), mediation, volume management, FibreChannel (F/C) switching, SCSI and RAID services.

[0078]FIG. 2 depicts an interconnect architecture adapted for use in theswitching system 100 of FIG. 1. As shown therein, the architectureincludes multiple processors interconnected by dual paths 110, 120. Path110 is a management and control path adapted for operation in accordancewith switched Ethernet. Path 120 is a high speed switching fabric,supporting a point to point serial interconnect. Also as shown in FIG.2, front-end processors include SFCs 130, LAN Resource Cards (LRCs) 132,and Storage Resource Cards (SRCs) 134, which collectively provideprocessing power for the functions described below. Rear-end processorsinclude MICs 136, LIOs 138 and SIOs 140, which collectively providewiring and control for the functions described below.

[0079] In particular, the LRCs provide interfaces to external LANs,servers, WANs and the like (such as by 4×Gigabit Ethernet or 32×10/100Base-T Ethernet interface adapters); perform load balancing,content-aware switching of internal services; implement storagemediation protocols; and provide TCP hardware acceleration.

[0080] The SRCs interface to external storage or other devices (such asvia Fibre Channel, 1 or 2 Gbps, FC-AL or FC-N)

[0081] As shown in FIG. 3, LRCs and LIOs are network processorsproviding LAN-related functions. They can include GBICs and RJ45processors. MICs provide control and management. As discussed below, theswitching system utilizes redundant MICs and redundant fabrics. The FlOsshown in FIG. 3 provide F/C switching. These modules can be commerciallyavailable ASIC-based F/C switch elements, and collectively enable lowcost, high-speed SAN using the methods described below.

[0082]FIG. 4 depicts a software architecture adapted for use in anembodiment of switching system 100, wherein a management layer 402interconnects with client services 404, mediation services 406, storageservices 408, a client abstraction layer 410, and a storage abstractionlayer 412. In turn, the client abstraction layer interconnects withclient interfaces (LAN, SAN or other) 414, and the storage abstractionlayer interconnects with storage devices or storage interfaces (LAN, SANor other) 416.

[0083] The client abstraction layer isolates, secures, and protectsinternal resources; enforces external group isolation and userauthentication; provides firewall access security; supports redundantnetwork access with fault failover, and integrates IP routing andmultiport LAN switching. It addition, it presents external clients witha “Virtual service” abstraction of internal services, so that there isno need to reconfigure clients when services are changed. Further, itprovides internal services a consistent network interface, whereinservice configuration is independent of network connectivity, and thereis no impact from VLAN topology, multihoming or peering.

[0084]FIG. 5 provides detail of the client abstraction layer. As showntherein, it can include TCP acceleration function 502 (which, amongother activities, offloads processing reliable data streams); loadbalancing function 504 (which distributes requests among equivalentresources); content-aware switching 506 (which directs requests to anappropriate resource based on the contents of the requests/packets);virtualization function 508 (which provides isolation and increasedsecurity); 802.1 switching and IP routing function 510 (which supportslink/path redundancy), and physical I/F support functions 512 (which cansupport 10/100Base-T, Gigabit Ethernet, Fibre Channel and the like).

[0085] In addition, an internal services layer provides protocolmediation, supports NAS and switching and routing. In particular, iniSCSI applications the internal services layer uses TCP/IP or the liketo provide LAN-attached servers with access to block-oriented storage;in FC/IP it interconnects Fibre Channel SAN “islands” across an Internetbackbone; and in IP/FC applications it extends IP connectivity acrossFibre Channel. Among NAS functions, the internal services layer includessupport for NFS (industry-standard Network File Service, provided overUDP/IP (LAN) or TCP/IP (WAN); and CIFS (compatible with MicrosoftWindows File Services, also known as SMB. Among switching and routingfunctions, the internal services layer supports Ethernet, Fibre Channeland the like.

[0086] The storage abstraction layer shown in FIG. 6 includes filesystem 602, volume management 604, RAID function 606, storage accessprocessing 608, transport processing 610 an physical I/F support 612.File system layer 602 supports multiple file systems; the volumemanagement layer creates and manages logical storage partitions; theRAID layer enables optional data replication; the storage accessprocessing layer supports SCSI or similar protocols, and the transportlayer is adapted for Fibre Channel or SCSI support. The storageabstraction layer consolidates external disk drives, storage arrays andthe like into a sharable, pooled resource; and provides volumemanagement that allows dynamically resizeable storage partitions to becreated within the pool; RAID service that enables volume replicationfor data redundancy, improved performance; and file service that allowscreation of distributed, sharable file systems on any storage partition.

[0087] A technical advantage of this configuration is that a singlestorage system can be used for both file and block storage access (NASand SAN).

[0088]FIGS. 7 and 8 depict examples of data flows through the switchingsystem 100. (It will be noted that these configurations are providedsolely by way of example, and that other configurations are possible.)In particular, as will be discussed in greater detail below, FIG. 7depicts a scaleable NAS example, while FIG. 8 depicts a replicatedlocal/remote storage example. As shown in FIG. 7, the switch system 100includes secure virtual storage domain (SVSD) management layer 702, NFSservers collectively referred to by numeral 704, and modules 706 and708.

[0089] Gigabit module 706 contains TCP 710, load balancing 712,content-aware switching 714, virtualization 716, 802.1 switching and IProuting 718, and Gigabit (GV) optics collectively referred to by numeral720.

[0090] FC module 708 contains file system 722, volume management 724,RAID 726, SCSI 728, Fibre Channel 730, and FC optics collectivelyreferred to by numeral 731.

[0091] As shown in the scaleable NAS example of FIG. 7, the switchsystem 100 connects clients on multiple Gigabit Ethernet LANs 732 (orsimilar) to (1) unique content on separate storage 734 and replicatedfilesystems for commonly accessed files 736. The data pathways depictedrun from the clients, through the GB optics, 802.1 switching and IProuting, virtualization, content-aware switching, load balancing andTCP, into the NFS servers (under the control/configuration of SVSDmanagement), and into the file system, volume management, RAID, SCSCI,Fibre Channel, and FC optics to the unique content (which bypassesRAID), and replicated filesystems (which flows through RAID).

[0092] Similar structures are shown in the replicated local/remotestorage example of FIG. 8. However, in this case, the interconnection isbetween clients on Gigagbit Ethernet LAN (or similar) 832, secondarystorage at an offsite location via a TCP/IP network 834, and locallyattached primary storage 836. In this instance, the flow is from theclients, through the GB optics, 802.1 switching and IP routing,virtualization, content-aware switching, load balancing and TCP, thenthrough iSCSI mediation services 804 (under the control/configuration ofSVSD management 802), then through volume management 824, and RAID 826.Then, one flow is from RAID 826 through SCSI 828, Fibre Channel 830 andFC Optics 831 to the locally attached storage 836; while another flow isfrom RAID 826 back to TCP 810, load balancing 812, content-awareswitching 814, virtualization 816, 802.1 switching and IP routing 818and GB optics 820 to secondary storage at an offsite location via aTCP/IP network 834.

II. Hardware/Software Architecture

[0093] This section provides an overview of the structure and functionof the invention (alternatively referred to hereinafter as the “Pirusbox”). In one embodiment, the Pirus box is a 6 slot, carrier class, highperformance, multi-layer switch, architected to be the core of the datastorage infrastructure. The Pirus box will be useful for ASPs(Application Storage Providers), SSPs (Storage Service Providers) andlarge enterprise networks. One embodiment of the Pirus box will supportNetwork Attached Storage (NAS) in the form of NFS attached disks off ofFibre Channel ports. These attached disks are accessible via 10/100/1000switched Ethernet ports. The Pirus box will also support standard layer2 and Layer 3 switching with port-based VLAN support, and layer 3routing (on unlearned addresses). RIP will be one routing protocolsupported, with OSPF and others also to be supported. The Pirus box willalso initiate and terminate a wide range of SCSI mediation protocols,allowing access to the storage media either via Ethernet or SCSI/FC. Thebox is manageable via a CLI, SNMP or an HTTP interface.

[0094] 1 Software Architecture Overview

[0095]FIG. 9 is a block diagram illustrating the software modules usedin the Pirus box (the terms of which are defined in the glossary setforth below). As shown in FIG. 9, the software structures correspond toMIC 902, LIC 904, SRC-NAS 908 and SRC-Mediator 910, interconnected byMLAN 905 and fabric 906. The operation of each of the components shownin the drawing is discussed below.

[0096] 1.1 System Services

[0097] The term System Service is used herein to denote a significantfunction that is provided on every processor in every slot. It iscontemplated that many such services will be provided; and that they canbe segmented into 2 categories: 1) abstracted hardware services and 2)client/server services. The attached FIG. 10 is a diagram of some of theexemplary interfaces. As shown in FIG. 10, the system servicescorrespond to IPCs 1002 and 1004 associated with fabric and controlchannel 1006, and with services SCSI 1008, RSS 1010, NPCS 1012, AM 1014,Log/Event 1016, Cache/Bypass 1018, TCP/IP 1020, and SM 1022.

[0098] 1.1.1 SanStreaM (SSM) System Services (S2)

[0099] SSM system service can be defined as a service that provides asoftware API layer to application software while “hiding” the underlyinghardware control. These services may add value to the process by addingprotocol layering or robustness to the standard hardware functionality.

[0100] System services that are provided include:

[0101] Card Processor Control Manager (CPCM). This service provides amechanism to detect and manage the issues involved in controlling aNetwork Engine Card (NEC) and its associated Network Processors (NP).They include insertion and removal, temperature control, crashmanagement, loader, watchdog, failures etc.

[0102] Local Hardware Control (LHC). This controls the hardware local tothe board itself. It includes LEDS, fans, and power.

[0103] Inter-Processor Communication (IPC). This includes control busand fabric services, and remote UART.

[0104] 1.1.2 SSM Application Service (AS)

[0105] Application services provide an API on top of SSM systemservices. They are useful for executing functionality remotely.

[0106] Application Services include:

[0107] Remote Shell Service (RSS)—includes redirection of debug andother valuable info to any pipe in the system.

[0108] Statistics Provider—providers register with the stats consumer toprovide the needed information such as mib read only attributes.

[0109] Network Processor Config Service (NPCS)—used to receive andprocess configuration requests.

[0110] Action Manager—used to send and receive requests to executeremote functionality such as rebooting, clearing stats and re-syncingwith a file system.

[0111] Logging Service—used to send and receive event logginginformation.

[0112] Buffer Management—used as a fast and useful mechanism forallocating, typing, chaining and freeing message buffers in the system.

[0113] HTTP Caching/Bypass service—sub-system to supply an API andfunctional service for HTTP file caching and bypass. It will make thedetermination to cache a file, retrieve a cached file (on board or off),and bypass a file (on board or not). In addition this service will keeptrack of local cached files and their associated TTL, as well asstatistics on file bypassing. It will also keep a database of knownfiles and their caching and bypassing status.

[0114] Multicast services—A service to register, send and receivemulticast packets across the MLAN.

[0115] 2. Management Interface Card

[0116] The Management Interface Card (MIC) of the Pirus box has a singlehigh performance microprocessor and multiple 10/100 Ethernet interfacesfor administration of the SANStream management subsystem. This card alsohas a PCMCIA device for bootstrap image and configuration storage.

[0117] In the illustrated embodiments, the Management Interface Cardwill not participate in any routing protocol or forwarding pathdecisions. The IP stack and services of VxWorks will be used as theunderlying IP facilities for all processes on the MIC. The MIC card willalso have a flash based, DOS file system.

[0118] The MIC will not be connected to the backplane fabric but will beconnected to the MLAN (Management LAN) in order to send/receive datato/from the other cards in the system. The MLAN is used for all MIC“other cards” communications.

[0119] 2.1. Management Software

[0120] Management software is a collection of components responsible forconfiguration, reporting (status, statistics, etc), notification(events) and billing data (accounting information). The managementsoftware may also include components that implement services needed bythe other modules in the system.

[0121] Some of the management software components can exist on anyprocessor in the system, such as the logging server. Other componentsreside only on the MIC, such as the WEB Server providing the WEB userinterface.

[0122] The strategy and subsequent architecture must be flexible enoughto provide a long-term solution for the product family. In other words,the 1.0 implementation must not preclude the inclusion of additionalmanagement features in subsequent releases of the product.

[0123] The management software components that can run on either the MICor NEC need to meet the requirement of being able to “run anywhere” inthe system.

[0124] 2.2 Management Software Overview

[0125] In the illustrated embodiments the management software decomposesinto the following high-level functions, shown in FIG. 11. As shown inthe example of FIG. 11 (other configurations are also possible andwithin the scope of the invention), management software can be organizedinto User Interfaces (UIs) 1102, rapid control backplane (RCB) datadictionary 1104, system abstraction model (SAM) 1106, configuration &statistics manager (CSM) 1108, and logging/billing APIs 1110, on module1101. This module can communicate across system services (S2) 1112 andhardware elements 1114 with configuration & statistics agent (CSA) 1116and applications 1118.

[0126] The major components of the management software include thefollowing:

[0127] 2.2.1 User Interfaces (UIs)

[0128] These components are the user interfaces that allow the useraccess to the system via a CLI, HTTP Client or SNMP Agent.

[0129] 2.2.2 Rapid Control Backplane (RCB)

[0130] These components make up the database or data dictionary ofsettable/gettable objects in the system. The Uls use “Rapid Marks”(keys) to reference the data contained within the database. The actuallocation of the data specified by a Rapid Mark may be on or off the MIC.

[0131] 2.2.3 System Abstraction Model (SAM)

[0132] These components provide a software abstraction of the physicalcomponents in the system. The SAM works in conjunction with the RCB toget/set data for the UIs. The SAM determines where the data resides andif necessary interacts with the CSM to get/set the data.

[0133] 2.2.4 Configuration & Statistics Manager (CSM)

[0134] These components are responsible for communicating with the othercards in the system to get/set data. For example the CSM sendsconfiguration data to a card/processor when a UI initiates a change andreceives statistics from a card/processor when a UI requests some data.

[0135] 2.2.5 Logging/Billing APIs

[0136] These components interface with the logging and event serversprovided by System Services and are responsible for sendinglogging/billing data to the desired location and generating SNMPtraps/alerts when needed.

[0137] 2.2.6 Configuration & Statistics Agent (CSA)

[0138] These components interface with the CSM on the MIC and respondsto CSM messages for configuration/statistics data.

[0139] 2.3 Dynamic Configuration

[0140] The SANStream management system will support dynamicconfiguration updates. A significant advantage is that it will beunnecessary to reboot the entire chassis when an NP's configuration ismodified. The bootstrap configuration can follow similar dynamicguidelines. Bootstrap configuration is merely dynamic configuration ofan NP that is in the reset state.

[0141] Both soft and hard configuration will be supported. Softconfiguration allows dynamic modification of current system settings.

[0142] Hard configuration modifies bootstrap or start-up parameters. Ahard configuration is accomplished by saving a soft configuration. Ahard configuration change can also be made by (T)FTP of a configurationtile. The MIC will not support local editing of configuration files.

[0143] In a preferred practice of the invention DNS services will beavailable and utilized by MIC management processes to resolve hostnamesinto IP addresses.

[0144] 2.4 Management Applications

[0145] In addition to providing “rote” management of the system, themanagement software will be providing additional managementapplications/functions. The level of integration with the WEB UI forthese applications can be left to the implementer. For example theZoning Manager could be either be folded into the HTML pages served bythe embedded HTTP server OR the HTTP server could serve up a stand-aloneJAVA Applet.

[0146] 2.4.1 Volume Manager

[0147] A preferred practice of the invention will provide a volumemanager function. Such a Volume Manager may support:

[0148] Raid 0—Striping

[0149] Raid 1—Mirroring

[0150] Hot Spares

[0151] Aggregating several disks into a large volume.

[0152] Partitioning a large disk into several smaller volumes.

[0153] 2.4.2 Load Balancer

[0154] This application configures the load balancing functionality.This involves configuring policies to guide traffic through the systemto its ultimate destination. This application will also report statusand usage statistics for the configured policies.

[0155] 2.4.3 Server-less Backup (NDMP)

[0156] This application will support NDMP and allow for serverless backup. This will allow users the ability to back up disk devices to tapedevices without a server intervening.

[0157] 2.4.4 IP-ized Storage Management

[0158] This application will “hide” storage and FC parameters fromIP-centric administrators. For example, storage devices attached to FCports will appear as IP devices in an HP-OpenView network map. Thesedevices will be “ping-able”, “discoverable” and support a limited scopeof MIB variables.

[0159] In order to accomplish this IP addresses be assigned to thestorage devices (either manually or automatically) and the MIC will haveto be sent all IP Mgmt (exact list TBD) packets destined for one of thestorage IP addresses. The MIC will then mediate by converting the IPpacket (request) to a similar FC/SCSI request and sending it to thedevice.

[0160] For example an IP Ping would become a SCSI Inquiry while a SNMPget of sysDescription would also be a SCSI Inquiry with some of thereturned data (from the Inquiry) mapped into the MIB variable andreturned to the requestor. These features are discussed in greaterdetail in the IP Storage Management section below.

[0161] 2.4.5 Mediation Manager

[0162] This application is responsible for configuring, monitoring andmanaging the mediation between storage and networking protocols. Thisincludes session configurations, terminations, usage reports, etc. Thesefeatures are discussed in greater detail in the Mediation Managersection below.

[0163] 2.4.6 VLAN Manager

[0164] Port level VLANs will be supported. Ports can belong to more thanone VLAN.

[0165] The VLAN Manager and Zoning Manager could be combined into a VDM(or some other name) Manager as a way of unifying the Ethernet and FCworlds.

[0166] 2.4.7 File System Manager

[0167] The majority of file system management will probably be to“accept the defaults”. There may be an exception if it is necessary toformat disks when they are attached to a Pirus system or perform otherdisk operations.

[0168] 2.5 Virtual Storage Domain (VSD)

[0169] Virtual storage domains serve 2 purposes.

[0170] 1. Logically group together a collection of resources.

[0171] 2. Logically group together and “hide” a collection of resourcesfrom the outside world.

[0172] The 2 cases are very similar. The second case is used when we areload balancing among NAS servers.

[0173]FIG. 12 Illustrates the First Example:

[0174] In this example Server 1 1226 is using SCSI/IP to communicate toDisks A and B at a remote site while Server 2 1224 is using SCSI/IP tocommunicate with Disks C and D 1208 at the same remote site. For thisconfiguration Disks A, B, C, and D must have valid IP addresses.Logically inside the PIRUS system 2 Virtual Domains are created, one forDisks A and B and one for Disks C and D. The IFF software doesn't needto know about the VSDs since the IP addresses for the disks are valid(exportable) it can simply forward the traffic to the correctdestination. The VSD is configured for the management of the resources(disks).

[0175] The second usage of virtual domains is more interesting. In thiscase let's assume we want to load balance among 3 NAS servers. A VSDwould be created and a Virtual IP Address (VIP) assigned to it. Externalentities would use this VIP to address the NAS and internally the PIRUSsystem would use NAT and policies to route the request to the correctNAS server. FIG. 13 illustrates this.

[0176] In this example users of the NAS service would simple referencethe VIP for Joe's ASP NAS LB service. Internally, through thecombination of virtual storage domains and policies the Pirus systemload balances the request among 3 internal NAS servers 1306, 1308, 1310,thus providing a scalable, redundant NAS solution.

[0177] Virtual Domains can be use to virtualize the entire Pirus system.

[0178] Within VSDs the following entities are noteworthy:

[0179] 2.5.1 Services

[0180] Services represent the physical resources. Examples of servicesare:

[0181] 1. Storage Devices attached to FC or Ethernet ports. Thesedevices can be simple disks, complex RAID arrays, FC-AL connections,tape devices, etc.

[0182] 2. Router connections to the Internet.

[0183] 3. NAS—Internally defined ones only.

[0184] 2.5.2 Policies

[0185] A preferred practice of the invention can implement the followingtypes of policies:

[0186] 1. Configuration Policy—A policy to configure another policy or afeature. For example a NAS Server in a virtual domain will be configuredas a “Service”. Another way to look at it is that a Configuration Policyis simply the collection of configurable parameters for an object.

[0187] 2. Usage Policy—A policy to define how data is handled. In ourcase load balancing is an example of a “Usage Policy”. When a userconfigures load balancing they are defining a policy that specifies howto distribute client requests based on a set of criteria.

[0188] There are many ways t o describe a policy or policies. For ourpurposes we will define a policy as composed of the following:

[0189] 1. Policy Rules—1 or more rules describing “what to do”. A ruleis made up of condition(s) and actions. Conditions can be as simple as“match anything” or as complex as “if source IP address 1.1.1.1 and it's2:05”. Likewise, actions can be as simple as “send to 2.2.2.2” orcomplex as “load balance using LRU between 3 NAS servers.)

[0190] 2. Policy Domain—A collection of object(s) Policy Rules apply to.For example, suppose there was a policy that said “load balance usinground robin”. The collection of NAS servers being load balanced is thepolicy domain for the policy.

[0191] Policies can be nested to form complex policies.

[0192] 2.6 Boot Sequence and Configuration

[0193] The MIC and other cards coordinate their actions during boot upconfiguration processing via System Service's Notify Service. Theseactions need to be coordinated in order to prevent the passing oftraffic before configuration file processing has completed.

[0194] The other cards need to initialize with default values and setthe state of their ports to “hold down” and wait for a “Config Complete”event from the MIC. Once this event is received the ports can bereleased and process traffic according to the current configuration.(Which may be default values if there were no configuration commands forthe ports in the configuration file.)

[0195]FIG. 14 illustrates this part of the boot up sequence andinteractions between the MIC, S2 Notify and other cards.

[0196] There is an error condition in this sequence where the card neverreceives the “Config Complete” event. Assuming the software is workingproperly than this condition is caused by a hardware problem and theports on the cards will be held in the “hold down” state. If CSM/CSA isworking properly than the MIC Mgmt Software will show the ports down orCPCM might detect that the card is not responding and notify the MIC. Inany case there are several ways to learn about and notify users aboutthe failure.

[0197] 3. LIC Software

[0198] The LIC (Lan Interface Card) consists of LAN Ethernet ports of10/100/1000 Mbps variety. Behind the ports are 4 network engineprocessors. Each port on a LIC will behave like a layer 2 and layer 3switch. The functionality of switching and intelligent forwarding isreferred to herein as IFF—Intelligent Forwarding and Filtering. The mainpurpose of the network engine processors is to forward packets based onLayer 2, 3, 4 or 5 information. The ports will look and act like routerports to hosts on the LAN. Only RIP will be supported in the firstrelease, with OSPF to follow.

[0199] 3.1 VLANs

[0200] The box will support port based VLANs. The division of the portswill be based on configuration and initially all ports will belong tothe same VLAN. Alternative practices of the invention can include VLANclassification and tagging, including possibly 802.1p and 802.1 Qsupport.

[0201] 3.1.1 Intelligent Filtering and Forwarding (IFF)

[0202] The IFF features are discussed in greater detail below. Layer 2and layer 3 switching will take place inside the context of IFF.Forwarding table entries are populated by layer 2 and 3 addresslearning. If an entry is not known the packet is sent to the IP routinglayer and it is routed at that level.

[0203] 3.2 Load Balance Data Flow

[0204] NFS load balancing will be supported within a SANStream chassis.Load balancing based upon VIRUTAL IP addresses, content and flows areall possible.

[0205] The SANStream box will monitor the health of internal NFS serversthat are configured as load balancing servers and will notify networkmanagement of detectable issues as well as notify a disk managementlayer so that recovery may take place. It will in these cases, stopsending requests to the troubled server, but continue to load balanceacross the remaining NFS servers in the virtual domain.

[0206] 3.3 LIC—NAS Software

[0207] 3.3.1 Virtual Storage Domains (VSD)

[0208]FIG. 15 provides another VSD example. The switch system of theinvention is designed to support, in one embodiment, multiple NFS andCIFS servers in a single device that are exported to the user as asingle NFS server (only NFS is supported on the first release). Theseservers are masked under a single IP address, known as a Virtual StorageDomain (VSD). Each VSD will have one to many connections to the networkvia a Network Processor (NP) and may also have a pool of Servers (willbe referred to as “Server” throughout this document) connected to theVSD via the fabric on the SRC card.

[0209] Within a virtual domain there are policy domains. Thesesub-layers define the actions needed to categorize the frame and send itto the next hop in the tree. These polices can define a large range ofattributes in a frame and then impose an action (implicit or otherwise).Common polices may include actions based on protocol type (NFS, CIFS,etc.) or source and destination IP or MAC address. Actions may includeimplicit actions like forwarding the frame on to the next policy forfurther processing, or explicit actions such as drop.

[0210]FIG. 15 diagrams a hypothetical virtual storage domain owned byFred's ASP 1502. In this example Fred has the configured address of1.1.1.1 that is returned by the domain name service when queried for thedomain's IP address. The next level of configuration is the policydomain. When a packet arrives into the Pirus box from a router port itis classified as a member of Fred's virtual domain because of itsdestination IP address. Once the virtual domain has been determined itsconfiguration is loaded in and a policy decision is made based on theconfigured policy. In the example above lets assume an NFS packetarrived. The packet will be associated with the NFS policy domain and aNAT (network address translation—described below) takes place, with thedestination address that of the NFS policy domain. The packet now getsassociated with the NFS policy domain for Yahoo. The process continueswith the configuration of the NFS policy being loaded in and a decisionbeing made based on the configured policy. In the example above the nextdecision to be made is whether or not the packet contains the gold,silver, or bronze service. Once that determination is made (let's assumethe client was identified as a gold customer), a NAT is performed againto make the destination the IP address of the Gold policy domain. Thepacket now gets associated with the Gold policy domain. The processcontinues with the configuration for the Gold policy being loaded in anda decision being made based on the configured policy. At this point aload balancing decision is made to pick the best server to handle therequest. Once the server is picked, NAT is again performed and thedestination IP address of the server is set in the packet. Once thedestination IP address of the packet becomes a device configured forload balancing, a switching operation is made and the packet is sent outof the box.

[0211] The implementation of the algorithm above lends itself torecursion and may or may not incur as many NAT steps as described. It isleft to the implementer to short cut the number of NAT's whilemaintaining the overall integrity of the algorithm.

[0212]FIG. 15 also presents the concept of port groups 1512, 1516. Portgroups are entities that have identical functionality and are members ofthe same virtual domain. Port group members provide a service. Bydefinition, any member of a particular port group, when presented with arequest, must be able to satisfy that request. Port groups may haverouters, administrative entities, servers, caches, or other Pirus boxesoff of them.

[0213] Virtual Storage Domains can reside across slots but not boxes.More than one Virtual Storage Domain can share a Router Interface.

[0214] 3.3.2 Network Address Translation (NAT)

[0215] NAT translates from one IP Address to another IP Address. Thereasons for doing NAT is for Load Balancing, to secure the identity ofeach Server from the Internet, to reduce the number of IP Addressespurchased, to reduce the number of Router ports needed, and the like.

[0216] Each Virtual Domain will have an IP Address that is advertisedthru the network NP ports. The IP Address is the address of the VirtualDomain and NOT the NFS/CIFS Server IP Address. The IP Address istranslated at the Pirus device in the Virtual Storage Domain to theServer's IP Address. Depending on the Server chosen, the IP Address istranslated to the terminating Server IP Address.

[0217] For example, in FIG. 15, IP Address 100.100.100.100 wouldtranslate to 1.1.1.1, 1.1.1.2 or 1.1.1.3 depending on the terminatingServer.

[0218] 3.3.3 Local Load Balance (LLB)

[0219] Local load balancing defines an operation of balancing betweendevices (i.e. servers) that are connected directly or indirectly off theports of a Pirus box without another load balancer getting involved. Alower-complexity implementation would, for example, support only thebalancing of storage access protocols that reside in the Pirus box.

[0220] 3.3.3.1 Load Balancing Order of Operations:

[0221] In the process of load balancing configuration it may be possibleto define multiple load balancing algorithms for the same set ofservers. The need then arises to apply an order of operations to theload balancing methods. They are as follows in the order they areapplied:

[0222] 1) Server loading info, Percentage of loading on the serversEthernet, Percentage of loading on the servers FC port, SLA support,Ratio Weight rating

[0223] 2) Round Trip Time, Response time, Packet Rate, Completion Rate

[0224] 3) Round Robin, Least Connections, Random

[0225] Load balancing methods in the same group are treated with thesame weight in determining a servers loading. As the load balancingalgorithms are applied, servers that have identical load characteristics(within a certain configured percentage) are moved to the next level inorder to get a better determination of what server is best prepared toreceive the request. The last load balancing methods that will beapplied across the servers that have the identical load characteristics(again within a configured percentage) are round robin, least connectionand random. 3.3.3.2 File System Server Load Balance (FSLB):

[0226] The system of the invention is intended to provide load balancingacross at least two types of file system servers, NFS and CIFS. NFS isstateless and CIFS is stateful so there are differences to each method.The goal of file system load balancing is not only to pick the bestidentical server to handle the request, but to make a single virtualstorage domain transparently hidden behind multiple servers.

[0227] 3.3.3.3 NFS Server Load Balancing (NLB):

[0228] NFS is mostly stateless and idempotent (every operation returnsthe same result if it is repeated). This is qualified because operationssuch as READ are idempotent but operations such as REMOVE are not. Sincethere is little NFS server state as well as little NFS client statetransferred from one server to the other, it is easy for one server toassume the other server's functions. The protocol will allow for aclient to switch NFS requests from one server to another transparently.This means that the load balancer can more easily maintain an NFSsession if a server fails. For example if in the middle of a request aserver dies, the client will retry, the load balancer will pick anotherserver and the request gets fulfilled (with possibly a file handle NAT),after only a retry. If the server dies between requests, then thereisn't even a retry, the load balancer just picks a new server andfulfills the request (with possibly a file handle NAT).

[0229] When using NFS managers it will be possible to set up the loadbalancer to load across multiple NFS servers that have identical data,or managers can set up load balancing to segment the balancing acrossservers that have unique data. The latter requires virtual domainconfiguration based on file requested (location in the file system tree)and file type. The former requires a virtual domain and minimal.otherconfiguration (i.e. load balancing policy).

[0230] The function of Load Balance Data Flow is to distribute theprocessing of requests over multiple servers. Load Balance Data Flow isthe same as the Traditional Data Flow but the NP statisticallydetermines the load of each server that is part of the specified NFSrequest and forwards the request based on that server load. Theload-balancing algorithm could be as simple as round robin or a moresophisticated administrator configured policy.

[0231] Server load balance decisions are made based upon IP destinationaddress. For any server IP address, a routing NP may have a table ofconfigured alternate server IP addresses that can process an HTTPtransaction. Thus multiple redundant NFS servers are supported usingthis feature.

[0232] TCP based server load balance decisions are made within the NP ona per connection basis. Once a server is selected through the balancingalgorithm all transactions on a persistent TCP connection will be madeto the same originally targeted server. An incoming IP message's sourceIP address and IP source Port number are the only connection lookup keysused by a NP.

[0233] For example, suppose a URL request arrives for 192.32.1.1. TheRouter NP processors lookup determines that server 192.32.1.1 is part ofa Server Group (192.32.1.1, 192.32.1.2, etc.). The NP decides whichServer Group to forward the request to via user-configured algorithm.Round-Robin, estimated actual load, and current connection count are allcandidates for selection algorithms. If TCP is the transport protocol,the TCP session is then terminated at the specified SRC processor.

[0234] UDP protocols do not have an opening SYN exchange that must beabsorbed and spoofed by the load balancing IXP. Instead each UDP packetcan be viewed as a candidate for balancing. This is both good and bad.The lack of opening SYN simplifies part of the balance operation, butthe effort of balancing each packet could add considerable latency toUDP transactions.

[0235] In some cases it will be best to make an initial balance decisionand keep a flow mapped for a user configurable time period. Once theperiod has expired an updated balance decision can be made in thebackground and a new balanced NFS server target selected.

[0236] In many cases it will be most efficient to re-balance a flowduring a relatively idle period. Many disk transactions result inforward looking actions on the server (people who read the 1st half of afile often want the 2nd half soon afterwards) and rebalancing duringactive disk transactions could actually hurt performance.

[0237] An amendment to the “time period” based flow balancing describedabove would be to arm the timer for an inactivity period and re-arm itwhenever NFS client requests are received. A longer inactivity timerperiod could be used to determine when a flow should be deleted entirelyrather than re-balanced.

[0238] 3.3.3.4 TCP and UDP—Methods of balancing:

[0239] NFS can run over both TCP and UDP (UDP being more prevalent).When processing UDP NFS requests the method used for psuedo-proxy of TCPsessions does not need to be employed. During a UDP session, theinformation to make a rational load balancing decision can be made withthe first packet.

[0240] Several methods of load balancing are possible. The first andsimplest to implement is load balancing based on source address—allrequests are sent to the same server for a set period of time after aload balancing decision is made to pick the best server at the UDPrequest or the TCP SYN.

[0241] Another method is to load balance every request with no regardfor the previous server the client was directed to. This will possiblyrequire obtaining a new file handle from the new server and NATing so asto hide the file handle change from the client. This method also carrieswith it more overhead in processing (every request is load balanced) andmore implementation effort, but does give a more balanced approach.

[0242] Yet another method for balancing NFS requests is to cache a “nextbalance” target based on previous experience. This avoids the overheadof extensive balance decision making in real time, and has the benefitof more even client load distribution.

[0243] In order to reduce the processing of file handle differencesbetween identical internal NFS servers, all disk modify operations willbe strictly ordered. This will insure that the inode numbering isconsistent across all identical disks.

[0244] Among the load balancing methods that can be used (others arepossible) are:

[0245] Round Robin

[0246] Least Connections

[0247] Random (lower IP-bits, hashing)

[0248] Packet Rate (minimum throughput)

[0249] Ratio Weight rating

[0250] Server loading info and health as well as application health

[0251] Round Trip Time (TCP echo)

[0252] Response time

[0253] 3.3.3.5 Write Replication:

[0254] NFS client read and status transactions can be freely balancedacross a VLAN family of peer NFS servers. Any requests that result indisk content modification (file create, delete, set-attributes, datawrite, etc.) must be replicated to all NFS servers in a VLAN server peergroup.

[0255] The Pirus Networks switch fabric interface (SFI) will be used tomulticast NFS modifications to all NFS servers in a VLAN balancing peergroup. All NFS client requests generate server replies and have a uniquetransaction ID. This innate characteristic of NFS can be used to verifyand confirm the success of multicast requests.

[0256] At least two mechanisms can be used for replicated transactionconfirmation. They are “first answer” and quorum. Using the “firstanswer” algorithm an IXP would keep minimal state for an outstanding NFSrequest, and return the first response it receives back to the client.The quorum system would require the IXP to wait for some percentage ofthe NFS peer servers to respond with identical messages before returningone to the client.

[0257] Using either method, unresponsive NFS servers are removed fromthe VLAN peer balancing group. When a server is removed from the groupthe Pirus NFS mirroring service must be notified so that recoveryprocedures can be initiated.

[0258] A method for coordinating NFS write replication is set forth inFIG. 16, including the following steps: check for NFS replication packet1602; if yes, multicast packet to entire VLAN NFS server peer group1604; wait for 1^(st) NFS server reply with timeout 1608; send 1^(st)server reply to client 1610; remove unresponsive servers from LB groupand inform NFS mirroring service 1610. If not an NFS replication packet,load balance and unicast to NFS server 1606.

[0259] 3.3.4 Load Balancer Failure Indication:

[0260] When a load balancer declares that a peer NFS server is beingdropped from the group the NFS mirroring service is notified. Adetermination must be made as to whether the disk failure was soft orhard.

[0261] In the case of a soft failure a hot synchronization should beattempted to bring the failing NFS server back online. All NFS modifytransactions must be recorded for playback to the failing NFS serverwhen it returns to service.

[0262] When a hard failure has occurred an administrator must benotified and fresh disk will be brought online, formatted, andsynchronized.

[0263] 3.3.4.1 CIFS Server Load Balancing:

[0264] CIFS is stateful and as such there are fewer options availablefor load balancing. CIFS is a session-oriented protocol; a client isrequired to log on to a server using simple password authentication or amore secure cryptographic challenge. CIFS supports no recoveryguarantees if the session is terminated through server or networkoutage. Therefore load balancing of CIFS requests must be done once atTCP SYN and persistence must be maintained throughout the session. If adisk fails and not the CIFS server, then a recovery mechanism can beemployed to transfer state from one server to another and maintain thesession. However if the server fails (hardware or software) and there isno way to transfer state from the failed server to the new server, thenthe TCP session must be brought down and the client must reestablish anew connection with a new server. This means relogging and recreatingstate in the new server.

[0265] Since CIFS is TCP based the balancing decision will be made atthe TCP SYN. Since the TCP session will be terminated at the destinationserver, that server must be able to handle all requests that the clientbelieves exists under that domain. Therefore all CIFS servers that aremasked by a single virtual domain must have identical content on them.Secondly data that spans an NFS server file system must be representedas a separate virtual domain and accessed by the client as another CIFSserver (i.e. another mount point).

[0266] Load balancing will support source address based persistence andsend all requests to the same server based on a timeout sinceinactivity. Load balancing methods used will be:

[0267] Round Robin

[0268] Least Connections

[0269] Random (lower IP-bits, hashing)

[0270] Packet Rate (minimum throughput)

[0271] Ratio Weight rating

[0272] Server loading info and health as well as application health

[0273] Round Trip Time (TCP echo)

[0274] Response time

[0275] 3.3.4.2 Content Load Balance:

[0276] Content load balancing is achieved by delving deeper into packetcontents than simple destination IP address.

[0277] Through configuration and policy it will be possible to re-targetNFS transactions to specific servers. based upon NFS header information.For example a configuration policy may state that all files under acertain directory load balanced between the two specified NFS servers.

[0278] A hierarchy of load balancing rules may be established whenServer Load Balancing is configured subordinate to Content LoadBalancing.

[0279] 3.4 LIC—SCSI/IP Software

[0280] 3.5 Network Processor Functionality

[0281]FIG. 17 is a top-level block diagram of the software on an NP.Note that the implementation of a block may be split across the policyprocessor and the micro-engines. Note also that not all blocks may bepresent on all NPs. The white blocks are common (in concept and to somelevel of implementation) between all NPs, the lightly shaded blocks arepresent on NP that have load balancing and storage server healthchecking enabled on them.

[0282] 3.5.1 Flow Control

[0283] 3.5.1.1 Flow Definition:

[0284] Flows are defined as source port, destination port, and sourceand destination IP address. Packets are tagged coming into the box andclassified by protocol, destination port and destination IP address.Then based on policy and/or TOS bit a priority is assigned within theclass. Classes are associated with a priority when compared to otherclasses. Within the same class priorities are assigned to packets basedon the TOS bit setting and/or policy.

[0285] 3.5.1.2 Flow Control Model:

[0286] Flow control will be provided within the SANStream product to theextent described in this section. Each of the egress Network Processorswill perform flow control. There will be a queue High Watermark thatwhen approached will cause flow control indications from egress NetworkProcessor to offending Network Processors based on QoS policy. Theoffending Network Processor will narrow TCP windows (when present) toreduce traffic flow volumes. If the egress Network Processors exceeds aHard Limit (something higher than the High Watermark), the egressNetwork Processor will perform intelligent dropping of packets based onclass priority and policy. As the situation improves and the LowWatermark is approached, egress control messages back the offendingnetwork processors allow for resumption of normal TCP window sizes.

[0287] For example, in FIG. 18, the egress Network Processor is NP1 1802and the offending Network Processors are NP2 1804 and NP4 1808. NP2 andNP4 were determined to be offending NPs based the High Watermark andeach of their policies. NP1, detecting the offending NPs, sends flowcontrol messages to each of the processors. These offending processorsshould perform flow control as described previously. If the Hard Limitis reached in NP1, then packets received by NP2 or NP4 can be droppedintelligently (in a manner that can be determined by the implementer).

[0288] 3.5.2 Flow Thru Vs. Buffering

[0289] There will be a distinct differentiation in performance betweenthe flow-thru and the other slower paths of processing.

[0290] 3.5.2.1 Flow Thru:

[0291] Fast path processing will be defined as flow-thru. This path willnot include buffering. Packets in this path must be designated asflow-thru within the first N bytes (Current thinking is M ports for theIXP-1200). These types of packets will be forwarded directly to thedestination processor to then be forwarded out of the box. Packets thatare eligible for flow-thru include flows that have a IFF table entry,Layer 2 switchable packets, packets from the servers to clients, and FCswitchable frames.

[0292] 3.5.2.2 Buffering:

[0293] Packets that require further processing will need to be bufferedand will take one of 2 paths.

[0294] Buffered Fast Path

[0295] First buffered path is taken on packets that require furtherlooking into the frame. These frames will need to be buffered in orderthat more of the packet can be loaded into a micro-engine forprocessing. These include deep processing of layer 4-7 headers, loadbalancing and QoS processing.

[0296] Slow Path

[0297] The second buffered path occurs when, during processing in amicro-engine, a determination is made that more processing needs tooccur that can't be done in a micro-engine. These packets requirebuffering and will be passed to the NP co-processor in that form. Whenthis condition has been detected the goal will be to process as much aspossible in the micro-engine before handing it up to the co-processor.This will take advantage of the performance that is inherent in amicro-engine design.

[0298] 4. SRC NAS

[0299] The Pirus Networks 1st generation Storage Resource Card (SRC) isimplemented with 4 occurrences of a high performance embedded computingkernel. A single instance of this kernel can contain the componentsshown in FIG. 19.

[0300] Software Features:

[0301] The SRC Phase 1 NAS software load will provide NFS servercapability. Key requirements include:

[0302] High performance—no software copies on read data, caching

[0303] High availability—balancing, mirroring

[0304] 4.1 SRC NAS Storage Features

[0305] 4.1.1 Volume Manager

[0306] A preferred practice of the Pirus Volume Manager provides supportfor crash recovery and resynchronization after failure. This module willinteract with the NFS mirroring service during resynchronizationperiods. Disk Mirroring (RAID-1), hot sparing, and striping (RAID-0) arealso supported.

[0307] 4.1.2 Disk Cache

[0308] Tightly coupled with the Volume Manager 2002, a Disk Cache module2004 will utilize the large pool of buffer RAM to eliminate redundantdisk accesses. Object based caching (rather than page-based) can beutilized. Disk Cache replacement algorithms can be dynamically tunedbased upon perceived role. Database operations (frequent writes) willbenefit from a different cache model than html serving (frequent reads).

[0309] 4.1.3 SCSI

[0310] Initiator mode support required in phase 1. This layer will betightly coupled with the Fibre Channel controller device. Implementerswill wish to verify the interoperability of this protocol with severalcurrent generation drives (IBM, Seagate), JBODs, and disk arrays.

[0311] 4.1.4 Fibre Channel

[0312] The disclosed system will provide support for fabric node(N_PORT) and arbitrated loop (NL_PORT). The Fibre Channel interfacedevice will provide support for SCSI initiator operations, withinteroperability of this interface with current generation FC Fabricswitches (such as those from Brocade, Ancor). Point-to-Point mode canalso be supported; and it is understood that the device will performmaster mode DMA to minimize processor intervention. It is also to beunderstood that the invention will interface and provide support tosystems using NFS, RPC (Remote Procedure Call), MNT, PCNFSD, NLM, MAPand other protocols.

[0313] 4.1.5 Switch Fabric Interface

[0314] A suitable switch fabric interface device driver is left to theimplementer. Chained DMA can be used to minimize CPU overhead.

[0315] 4.2 NAS Pirus System Features

[0316] 4.2.1 Configuration/Statistics

[0317] The expected complement of parameters and information will beavailable through management interaction with the Pirus chassis MICcontroller.

[0318] 4.2.2 NFS Load Balancing

[0319] The load balancing services of the LIC are also used to balancerequests across multiple identical NFS servers within the Pirus chassis.NFS data read balancing is a straightforward extension to plannedservices when Pirus NFS servers are hidden behind a NAT barrier.

[0320] With regard to NFS data write balancing, when a LIC receives NFScreate, write, or remove commands they must be multicast to allparticipating NFS SRC servers that are members of the load balancinggroup.

[0321] 4.2.3 NFS Mirroring Service

[0322] The NFS mirroring service is responsible for maintaining theintegrity of replicated NFS servers within the Pirus chassis. Itcoordinates the initial mirrored status of peer NFS servers upon userconfiguration. This service also takes action when a load-balancernotifies it that a peer NFS server has fallen out of the group or when anew disk “checks in” to the chassis.

[0323] This service interacts with individual SRC Volume Manager modulesto synchronize file system contents. It could run on a #9 processorassociated with any SRC module or on the MIC.

[0324] 5. SRC Mediation

[0325] Storage Mediation is the technology of bridging between storagemediums of different types. We will mediate between Fibre Channel targetand initiators and IP based target and initiators. The disclosedembodiment will support numerous mediation techniques.

[0326] 5.1 Supported Mediation Protocols 4

[0327] Mediation protocols that can be supported by the disclosedarchitecture will include Cisco's SCSI/TCP, Adaptec's SEP protocol, andthe standard canonical SCSI/UDP encapsulation.

[0328] 5.1.1 SCSI/UDP

[0329] SCSI/UDP has not been documented as a supported encapsulatedtechnique by any hardware manufacturer. However UDP has some advantagesin speed when comparing it to TCP. UDP however is not a reliabletransport. Therefore it is proposed that we use SCSI/UDP to extend theFibre Channel fabric through our own internal fabric (see FIG. 21demonstrating SCSI/UDP operation with elements 100, IBM 2102 and DiskArray 2104). The benefit to UDP is lower processing and latency.Reliable UDP (Cisco protocol) may also be used in the future if we wantto extend the protocol to the LAN or the WAN.

[0330] 5.2 Storage Components

[0331] The following discussion refers to FIG. 22, which depictssoftware components for storage (2202 et seq.).

[0332] 5.2.1 SCSI/IP Layer:

[0333] The SCSI/IP layer is a full TCP/IP stack and application softwarededicated to the mediation protocols. This is the layer that willinitiate and terminate SCSI/IP requests for initiators and targetsrespectively.

[0334] 5.2.2 SCSI M Diator:

[0335] The SCSI mediator acts as a SCSI server to incoming IP payload.

[0336] This thin module maps between IP addresses and SCSI devices andLUNs.

[0337] 5.2.3 Volume Manager

[0338] The Pirus Volume Manager will provide support for diskformatting, mirroring (RAID-1) and hot spare synchronization. Striping(RAID-0) may also be available in the first release. The VM must bebulletproof in the HA environment. NVRAM can be utilized to increaseperformance by committing writes before they are actually delivered todisk.

[0339] When the Volume manager is enabled a logical volume view ispresented to the SCSI mediator as a set of targetable LUNs. Theselogical volumes do not necessarily correspond to physical SCSI devicesand LUNs.

[0340] 5.2.4 SCSI Originator

[0341] In the disclosed architecture this layer will be tightly coupledwith the Fibre Channel controller device, with interoperability of thisprotocol with several current generation drives (IBM, Seagate), JBODs,and disk arrays. This module can be identical to its counterpart in theSRC NAS image.

[0342] 5.2.5 SCSI Target

[0343] SCSI target mode support will be required if external FC hostsare permitted to indirectly access remote SCSI disks via mediation (e.g.

[0344] SCSI/FC→SCSI/FC via SCSI/TCP).

[0345] 5.2.6 Fibre Channel

[0346] In the disclosed embodiments, support will be provided for fabricnode (N_PORT) and arbitrated loop (NL_PORT). The Fibre Channel interfacedevice will provide support for SCSI initiator or target operations.Interoperability of this interface with current generation FC Fabricswitches (Brocade, Ancor) must be assured. Point-to-Point mode must alsobe supported. This module should be identical to its counterpart in theSRC NAS image.

[0347] 5.3 Mediation Example

[0348]FIG. 23 depicts an FC originator communicating with an FC Target(elements 2302 et seq), as follows:

[0349] ORIGINATOR˜sends a SCSI Read Command to TARGET{circumflex over( )}

[0350] 1. Each Originator/Target pair complete their LIP Sequence. Each750 is notified of the existence of the Originator˜/Target{circumflexover ( )}.

[0351] 2. 750˜generates an IP command that tells IXP˜ to make aconnection to IXP{circumflex over ( )}.

[0352] 3. 750{circumflex over ( )} A generates an IP command to tellIXP{circumflex over ( )} to make Target{circumflex over ( )} ‘visible’over IP.

[0353] 4. Originator˜issues a SCSI READ CDB to Target˜. Target˜sends CDBto 750{circumflex over ( )}.

[0354] 5. 750˜builds SCSI/IP request with CDB and issues it to IXP˜.

[0355] 6. IXP˜sends packet to IXP{circumflex over ( )}.

[0356] 7. IXP{circumflex over ( )} sends IP packet to 750{circumflexover ( )}.

[0357] 8. 750{circumflex over ( )} A removes SCSI CDB from IP packet andissues SCSI CDB request to OriginatorA (memory for READ COMMAND has beenallocated).

[0358] 9. Originator{circumflex over ( )} issue FCP_CMND toTarget{circumflex over ( )}.

[0359] 10. When command is complete Target{circumflex over ( )} sendsFCP_RSP to Originator{circumflex over ( )}. Originator{circumflex over( )} notifies 750{circumflex over ( )} with good status.

[0360] 11. 750{circumflex over ( )} packages data and status into IPpackets sends to IXP{circumflex over ( )}.

[0361] 12. IXP{circumflex over ( )} sends data and status to IXP˜.

[0362] 13. IXP˜sends IP packets with data and status 750{circumflex over( )}.

[0363] 14. 750˜allocates buffer spaces, dumps data in to buffers andrequests Target{circumflex over ( )} to send data and response toOriginator˜.

III. NFS Load Balancing

[0364] An object of load balancing is that several individual serversare made to appear as a single, virtual, server to a client(s). Anoverview is provided in FIG. 24, including elements 2402 et seq. Inparticular, the client makes file system requests to a virtual server.These requests are then directed to one of the servers that make up thevirtual server. The file system requests can be broken into twocategories;

[0365] 1) reads, or those requests that do not modify the file system;and

[0366] 2) writes or those requests that do change the file system.

[0367] Read requests do not change the file system and thus can be sentto any of the individual servers that make up the virtual server. Whichserver a request is sent to is determined by one of several possibleload balancing algorithms. This spreads the requests across severalservers resulting in an improvement in performance over a single server.In addition, it allows the performance of a virtual server to be scaledsimply by adding more physical servers.

[0368] Some of the possible load balancing algorithms are:

[0369] 1. Round Robin where each request is sent to sequentially to thenext server.

[0370] 2. Weighted access where requests are sent to servers based on apercentage formula, e.g. 15% of the requests go to server A, 35% toserver B, and 50% to server C. These Weighting factors can be fixed, orbe dynamic based on such factors as server response time.

[0371] 3. File handle where requests for files that have been accessedpreviously are directed back to the server that originally satisfied therequest. This increases performance by increasing the likelihood thatthe file will be found in the server's cache.

[0372] Write requests are different from read requests in that they mustbe broadcast to each of the individual servers so that the file systemson each server stay in sync. Thus, each write request generates severalresponses, one from each of the individual servers. However, only oneresponse is sent back to the client.

[0373] An important way to improve performance is to return to theclient the first positive response from any of the servers instead ofwaiting for all the server responses to be received. This means theclient sees the fastest server response instead of the slowest. Aproblem can arise if all the servers do not send the same response, forexample one of the servers fails to do the write while all the othersare successful. This results in the server's file systems becomingun-sychronized. In order to catch and fix un-synchronized file systems,each outstanding write request must be remembered and the responses fromeach of the servers kept track of.

[0374] The file handle load balancing algorithm works well for directingrequests for a particular file to a particular server. This increasesthe likelihood that the file will be found in the server's cache,resulting in a corresponding increase in performance over the case wherethe server has to go out to a disk. It also has the benefit ofpreventing a single file from being cached on two different servers,which uses the servers' caches more efficiently and allows more files tobe cached. The algorithm can be extended to cover the case where a fileis being read by many clients and the rate at which it is served tothese clients could be improved by having more than one server servethis file. Initially a file's access will be directed to a singleserver. If the rate at which the file is being accessed exceeds acertain threshold another server can be added to the list of serversthat handle this file. Successive requests for this file can be handledin a round robin fashion between the servers setup to handle the file.Presumably the file will end up in the caches of both servers. Thisalgorithm can handle an arbitrary number of servers handling a singlefile.

[0375] The following discussion describes methods and apparatus forproviding NFS server load balancing in a system utilizing the Pirus box,and focuses on the process of how to balance file reads across severalservers.

[0376] As illustrated in FIG. 24, NFS load balancing is done so thatmultiple NFS servers can be viewed as a single server. An NFS clientissuing an NFS request does so to a single NFS IP address. Theserequests are captured by the NFS load balancing functionality anddirected toward specific NFS servers. The determination of which serverto send the request to is based on two criteria, the load on the serverand whether the server already has the file in cache.

[0377] The terms “SA” (the general purpose StrongArm processor thatresides inside an IXP) and “Micro-engine” (the Micro-coded processor inthe IXP are used herein. In one embodiment of the invention, there are 6in each IXP.)

[0378] As shown in the accompanying diagrams and specification, theinvention utilizes “workload distribution” methods in conjunction with amultiplicity of NFS (or other protocol) servers. Among these methods(generically referred to herein as “load balancing”) are methods of“server load balancing” and “content aware switching”.

[0379] A preferred practice of the invention combines both “LoadBalancing” and “Content Aware Switching” methods to distribute workloadwithin a file server system. A primary goal of this invention is toprovide scalable performance by adding processing units, while “hiding”this increased system complexity from outside users.

[0380] The two methods used to distribute workload have different butcomplimentary characteristics. Both rely on the common method ofexamining or interpreting the contents of incoming requests, and thenmaking a workload distribution decision based on the results of thatexamination.

[0381] Content Aware Switching presumes that the multiplicity of servershandle different contents; for example, different subdirectory trees ofa common file system. In this mode of operation, the workloaddistribution method would be to pass requests for (e.g.) “subdirectoryA” to one server, and “subdirectory B” to another. This method providesa fair distribution of workload among servers, given a statisticallylarge population of independent requests, but can not provide enhancedresponse to a large number of simultaneous requests for a small set offiles residing on a single server.

[0382] Server Load Balancing presumes that the multiplicity of servershandle similar content; for example, different RAID 1 replications ofthe same file system. In this mode of operation, the workloaddistribution method would be to select one of the set of availableservers, based on criteria such as the load on the server, itsavailability, and whether it has the requested file in cache. Thismethod provides a fair distribution of workload among servers, whenthere are many simultaneous requests for a relatively small set offiles.

[0383] These two methods may be combined, with content aware switchingselecting among sets of servers, within which load balancing isperformed to direct traffic to individual servers. As a separateinvention, the content of the servers may be dynamically changed, forexample by creating additional copies of commonly requested files, toprovide additional server capacity transparently to the user.

[0384] As shown in the accompanying diagrams and specification, oneelement of the invention is the use of multiple computational elements,e.g. Network Processors and/or Storage CPUs, interconnected with a highspeed connection network, such as a packet switch, crossbar switch, orshared memory system. The resultant tight, low latency couplingfacilitates the passing of necessary state information between thetraffic distribution method and the file server method.

[0385] 1. Operation

[0386] 1.1 Read Requests

[0387] Referring now to FIGS. 25 and 26, the following is the sequenceof events that occurs in one embodiment of the invention, when an NFSREAD (could also include other requests like LOOKUP) request isreceived.

[0388] 1. A Micro-engine receives a packet on one of its ports from anNFS client that contains a READ request to the NFS domain.

[0389] 2. The Micro-engine uses the file handle contained in the requestto perform a lookup in a file handle hash table.

[0390] 3. The hash lookup results in a pointer to a file handle entry(we'll assume a hit for now).

[0391] 4. In the hash table is the IP address for the specific NFSserver the request should be directed to. Presumably this NFS servershould have the file in its cache and thus be able to serve it up morequickly than one that does not.

[0392] 5. The destination IP address of the packet with the READ requestis updated with the server IP address and then forwarded to the server.

[0393] A hash table entry can have more than one NFS server IP address.This allows a file that is under heavy access to exist in more than oneNFS server cache and thus to be served up by more than one server. Theselection of which specific server to direct a specific READ request tocan be determined, but could be as simple as a round robin.

[0394] 1.2 Determining the Number of Servers for a File

[0395] The desired behavior is that:

[0396] 1. Files that are lightly accessed, i.e. have a low number ofaccesses per second, only need to be served by a single server.

[0397] 2. Files that are heavily accessed are served by more than oneserver.

[0398] 3. Accesses to a file are directed to the same server, or set ofservers if it is being heavily accessed, to keep accesses directed tothose servers that have that file in its cache.

[0399] 1.3 Server Lists

[0400] In addition to being able to be looked up using the file handlehash table, file handle entries can be placed on doubly linked lists.There can be a number of such linked lists. Each list has the filehandle entries on it that have a specific number of servers servingthem. There is a list for file handle entries that have only one serverserving them. Thus, as shown in FIG. 27, for example, there might atotal of three lists; a single server list, a two-server list and afour-server list. The single server list has entries in it that arebeing served by one server, the two-server list is a list of the entriesbeing served by two servers, etc.

[0401] File handle entries are moved from list to list as the frequencyof access increases or decreases.

[0402] 1.3.1 Single Serv r List

[0403] All the file handle entries begin on the single server list. Whena READ request is received the file handle in the READ is used to accessthe hash table. If there is no entry for that file handle a free entryis taken from the entry free list and a single server is selected toserve the file, by some criteria such as least loaded, fastestresponding or round robin. If no entries are free then a server isselected and the request is sent directly to it without an entry beingfilled out. Once a new entry is filled out it is added to the hash tableand placed at the top of the single server list queue.

[0404] Periodically, a process check the free list and if it is close toempty it will take some number of entries off the bottom of the singleserver list, remove them from hash table and then place them back on thefree list. This keeps the free list replenished.

[0405] Since entries are placed on the top of the list and taken offfrom the bottom, each entry spends a certain amount of time on the list,which varies according to rate at which new file handle READ requestsoccur. During the period of time that an entry exists on the list it hasthe opportunity to be hit by another READ access. Each time a hit occursa counter is bumped in the entry. If an entry receives enough hits whileit is on the list to exceed a pre-defined threshold it is deemed to haveenough activity to it to deserve to have more servers serving it. Suchan entry is then taken off the single server list, additional serversselected to serve the file, and then placed on one of the multipleserver lists.

[0406] In the illustrated embodiment of the invention, it is expectedthat the micro-engines will handle the lookup and forwarding of requeststo the servers, and that the SA will handle all the entry movementsbetween lists and adding and removing them from the hash table. However,other distributions of labor can be utilized.

[0407] 1.3.2 Multiple Server Lists

[0408] In addition to the single server list, there are multiple serverlists. Each multiple server list contains the entries that are beingserved by the same number of servers. Just like with entries on thesingle server list, entries on the multiple server lists get promoted tothe top of the next list when their frequency of access exceeds acertain threshold. Thus a file that is being heavily accessed might movefrom the single server list, to the dual server list and finally to thequad server list.

[0409] When an entry moves to a new list it is added to the top of thatlist. Periodically, a process will re-sort the list by frequency ofaccess. As a file becomes less frequently accessed it will move towardthe bottom of its list. Eventually the frequency of access will fallbelow a certain threshold and the entry will be placed on the top of theprevious list, e.g. an entry might fall off the quad server list and beput on the dual server list. During this demotion process the number ofservers serving this file will be reduced.

[0410] 1.4 Synchronizing Lists Across Multiple IXP's

[0411] The above scheme works well when one entity, i.e., an IXP, seesall the file READ requests. However, this will not be the case in mostsystems. In order to have the same set of servers serving a fileinformation must be passed between IXP's that have the same file entry.This information needs to be passed when an entry is promoted or demotedbetween lists, as this is when servers are added or taken away.

[0412] When an entry is going to be promoted by an IXP it firstbroadcasts to all the other IXP's asking for their file handle entriesfor the file handle of the entry it wants to promote. When it receivesthe entries from the other IXP's it looks to see whether one of theother IXP's has already promoted this entry. If it has, it adds the newservers from that entry. If not, it selects new servers based on someTBD criteria.

[0413] Demotion of an entry from one list to the other works much thesame way, except that when the demoting IXP looks at the entries fromthe other IXP's it looks for entries that have less servers than itsentry currently does. If there are any then it selects those servers.This keeps the same set of servers serving a file even as fewer of themare serving it. If there are no entries with fewer servers, then the IXPcan use one or more criteria to remove the needed number of servers fromthe entry.

[0414] There are advantages to making load balancing decisions basedupon filehandle information. When the inode portion of the filehandle isused to select a unique target NAS server for information reads, amaximally distributed cache is achieved. When an entire NAS working setof files fits in any one cache then a lowest latency response system iscreated by allowing all working set files to be simultaneously insideevery NAS server's cache. Load balancing is then best performed using around-robin policy.

[0415] Pirus NAS servers will provide cache utilization feedback to anIXP load balancer. The LB can use this feedback to dynamically shiftbetween maximally distributed caching and round-robin balancing forsmaller working sets. These processes are depicted in FIGS. 25 and 26(NFS Receive Micro-Code Flowchart and NFS Transmit Micro-CodeFlowchart).

IV. Intelligent Forwarding and Filtering

[0416] The following discussion describes certain Pirus box functionsreferred to as intelligent forwarding and filtering (IFF). IFF isoptimized to support the load balancing function described elsewhereherein. Hence, the following discussion contains various load balancingdefinitions that will facilitate an understanding of IFF.

[0417] As noted elsewhere herein, the Pirus box provides load-balancingfunctions, in a manner that is transparent to the client and server.Therefore, the packets that traverse the box do not incur a hop count asthey would, for example, when traversing a router. FIG. 28 isillustrative. In FIG. 28, Servers 1, 2, and 3 are directly connected tothe Pirus box (denoted by the pear icon), and packets forwarded to themare sent to their respective MAC addresses. Server 4 sits behind arouter and packets forwarded to it are sent to the MAC address of therouter interface that connects to the Pirus box. Two upstream routersforward packets from the Internet to the Pirus box.

[0418] 1. Definitions

[0419] The following definitions are used in this discussion:

[0420] A Server Network Processor (SNP) provides the functionality forports connected to servers. Packets received from a server are processedan SNP.

[0421] A Router Network Processor (RNP) provides the functionality forports connected to routers or similar devices. Packets received from arouter are processed an RNP.

[0422] In accordance with the invention, an NP may support the role ofRNP and SNP simultaneously. This is likely to be true, for example, on10/100 Ethernet modules, as the NP will server many ports, connected toboth routers and servers.

[0423] An upstream router is the router that connects the Internet tothe Pirus box.

[0424] 2. Virtual Domains

[0425] As used herein, the term “virtual domain” denotes a portion of adomain that is served by the Pirus box. It is “virtual” because theentire domain may be distributed throughout the Internet and a globalload-balancing scheme can be used to “tie it all together” into a singledomain.

[0426] In one practice of the invention, defining a virtual domain on aPirus box requires specifying one or more URLs, such as www.fred.com,and one or more virtual IP addresses that are used by clients to addressthe domain. In addition, a list of the IP addresses of the physicalservers that provide the content for the domain must be specified; thePirus box will load-balance across these servers. Each physical serverdefinition will include, among other things, the IP address of theserver and, optionally, a protocol and port number (used for TCP/UDPport multiplexing—see below).

[0427] For servers that are not directly connected to the Pirus box, aroute, most likely static, will need to be present; this route willcontain either the IP address or IP subnet of the server that is NOTdirectly connected, with a gateway that is the IP address of the routerinterface that connects to the Pirus box to be used as the next-hop tothe server.

[0428] The IP subnet/mask pairs of the devices that make up the virtualdomain should be configured. These subnet/mask pairs indirectly create aroute table for the virtual domain. This allows the Pirus box to forwardpackets within a virtual domain, such as from content servers toapplication or database servers. A mask of 255.255.255.255 can be usedto add a static host route to a particular device.

[0429] The Pirus box may be assigned an IP address from this subnet/maskpair. This IP address will be used in all IP and ARP packets authored bythe Pirus box and sent to devices in the virtual domain. If an IPaddress is not assigned, all IP and ARP packets will contain a source IPaddress equal to one of the virtual IP addresses of the domain. FIG. 29is illustrative. In FIG. 29, the Pirus box is designated by numeral 100.Also in FIG. 29, the syntax for a port is <slot number>.<port number>)ports 1.3, 2.3, 3.3, 4.3, 5.1 and 5.3 are part of the same virtualdomain. Server 1.1.1.1 may need to send packets to Cache 1.1.1.100. Eventhough the Cache may not be explicitly configured as part of the virtualdomain, configuring the virtual domain with an IP subnet/mask of1.1.1.0/255.255.255.0 will allow the servers to communicate with thecache. Server 1.1.1.1 may also need to send packets to Cache192.168.1.100. Since this IP subnet is outside the scope of the virtualdomain (i.e., the cache, and therefore the IP address, 20 may be ownedby the ISP), a static host route can be added to this one particulardevice.

[0430] 2.1 Network Address Translation

[0431] In one practice of the invention, Network Address Translation, orNAT, is performed on packets sent to or from a virtual I P address. InFIG. 29 above, a client connected to the Internet will send a packet toa virtual IP address representing a virtual domain. The load-balancingfunction will select a physical server to send the packet to. NATresults in the destination IP address (and possibly the destinationTCP/UDP port, if port multiplexing is being used) being changed to thatof the physical server. The response packet from the server also has NATperformed on it to change the source IP address (and possibly the sourceTCP/UDP port) to that of the virtual domain.

[0432] NAT is also performed when a load-balanceable server sends arequest that also passes through the load-balancing function, such as anNFS request. In this case, the server assumes the role of a client.

[0433] 3. VLAN Definition

[0434] It is contemplated that since the Pirus box will have manyphysical ports, the Virtual LAN (VLAN) concept will be supported. Portsthat connect to servers and upstream routers will be grouped into theirown VLAN, and the VLAN will be added to the configuration of a virtualdomain.

[0435] In one practice of the invention, a virtual domain will beconfigured with exactly one VLAN. Although the server farms comprisingthe virtual domain may belong to multiple subnets, the Pirus box willnot be routing (in a traditional sense) between the subnets, but will beperforming a form of L3 switching. Unlike today's L3 switch/routers thatswitch frames within a VLAN at Layer 2 and route packets between VLANsat Layer 3, the Pirus box will switch packets using a combination ofLayer 2 and Layer 3 information. It is expected that the complexity ofrouting between multiple VLANs will be avoided.

[0436] By default, packets received on all ports in the VLAN of avirtual domain are candidates for load balancing. On Router ports (see4.4.1, Router Port), these packets are usually HTTP or FTP requests. OnServer ports (see 4.4.2, Server Port), these packets are usuallyback-end server requests, such as NFS.

[0437] All packets received by the Pirus box are classified to a VLANand are, hence, associated with a virtual domain. In some cases, thisclassification may be ambiguous because, with certain constraints, aphysical port may belong to more than one VLAN. These constraints arediscussed below.

[0438] 3.1 Default VLAN

[0439] In one practice of the invention, by default, every port will beassigned to the Default VLAN. All non-IP packets received by the Pirusbox are classified to the Default VLAN. If a port is removed from theDefault VLAN, non-IP packets received on that port are discarded, andnon-IP packets received on other ports will not be sent on that port. Inaccordance with this practice of the invention, all non-IP packets willbe handled in the slow path. This CPU will need to build and maintainMAC address tables to avoid flooding all received packets on the DefaultVLAN. The packets will be forwarded to a single CPU determined by anelection process. This avoids having to copy (potentially large)forwarding tables between slots but may result in each packet traversingthe switch fabric twice.

[0440] 3.2 Server Administration VLAN

[0441] Devices connected to ports on the Server Administration VLAN canmanage the physical servers in any virtual domain. By providing onlythis form of inter-VLAN routing, the system can avoid having to addServer Administration ports (see below) to the VLANs of every virtualdomain that the server administration stations will manage.

[0442] 3.3 Server Access VLAN

[0443] A Server Access VLAN is used internally between Pirus boxes. APirus box can make a load-balancing decision to send a packet to aphysical server that is connected to another Pirus box. The packet willbe sent on a Server Access VLAN that, unlike packets received on Routerports, may directly address physical servers. See the discussion of LoadBalancing elsewhere herein for additional information on how this isused.

[0444] 3.4 Port Types

[0445] 3.4.1 Router Port

[0446] In one embodiment of the invention, one or more Router ports willbe added to the VLAN configuration of a virtual domain. Note that aRouter port is likely to be carrying traffic for many virtual domains.

[0447] Classifying a packet received on a Router port to a VLAN of avirtual domain is done by matching the destination IP address to one ofthe virtual IP addresses of the configured virtual domains.

[0448] ARP requests sent by the Pirus box to determine the MAC addressand physical port of the servers that are configured as part of avirtual domain are not sent out Router ports. If a server is connectedto the same port as an upstream router, the port must be configured as aCombo port (see below).

[0449] 3.4.2 Server Port

[0450] Server ports connect to the servers that provide the content fora virtual domain. A Server port will most likely be connected to asingle server, although it may be connected to multiple servers.

[0451] Classifying a packet received on a Server port to a VLAN of avirtual domain may require a number of steps.

[0452] 1. using the VLAN of the port if the port is part of a singleVLAN

[0453] 2. matching the destination IP address and TCP/UDP port number tothe source of a flow (i.e., an HTTP response)

[0454] 3. matching the destination IP address to one of the virtual IPaddresses of the configured virtual domains (i.e., an NFS request)

[0455] The default and preferred configuration is for a Server port tobe a member of a single VLAN. However, multiple servers, physical orlogical, may be connected to the same port and be in different VLANsonly if the packets received on that port can unambiguously beassociated with one of the VLANs on that port.

[0456] One way for this is to use different IP subnets for all deviceson the VLANs that the port connects to. TCP/UDP port multiplexing isoften configured with a single IP address on a server and multipleTCP/UDP ports, one per virtual domain. It is preferable to also use adifferent IP address with each TCP/UDP port, but this is necessary onlyif the single server needs to send packets with TCP/UDP ports other thanthe ones configured on the Pirus box.

[0457] In FIG. 30, the physical server with IP address 1.1.1.4 providesHTTP content for two virtual domains, www.larry.com and www.curly.com.TCP/UDP port multiplexing is used to allow the same server to providecontent for both virtual domains. When the Pirus box load balancespackets to this server, it will use NAT to translate the destination IPaddress to 1.1.1.4 and the TCP port to 8001 for packets sent towww.larry.com and 8002 for packets sent to www.curly.com.

[0458] Packets sent from this server with a source TCP port of 8001 or8002 can be classified to the appropriate domain. But if the serverneeds to send packets with other source ports (i.e., if it needs toperform an NFS request), it is ambiguous as to which domain the packetshould be mapped.

[0459] The list of physical servers that make up a domain may requiresignificant configuration. The IP addresses of each must be entered aspart of the domain. To minimize the amount of information that theadministrator must provide, the Pirus box determines the physical portthat connects to a server, as well as its MAC address, by issuing ARPrequests to the IP addresses of the servers. The initial ARP requestsare only sent out Server and Combo ports. The management software mayallow the administrator to specify the physical port to which a serveris attached. This restricts the ARP request used to obtain the MACaddress to that port only.

[0460] A Server port may be connected to a router that sits between thePirus box and a server farm. In this configuration, the VLAN of thevirtual domain must be configured with a static route of the subnet ofthe server farm that points to the IP address of the router portconnected to the Pirus box. This intermediate router needs a route backto the Pirus box as well (either a default route or a route to thevirtual IP address(es) of the virtual domain(s) served by the serverfarm.

[0461] 3.4.3 Combo Port

[0462] A Combo port, as defined herein, is connected to both upstreamrouters and servers. Packet VLAN classification first follows the rulesfor Router ports then Server ports.

[0463] 3.4.4 Server Administration Port

[0464] A Server Administration port is connected to nodes thatadminister servers. Unlike packets received on a Router port, packetsreceived on a Server Administration port can be sent directly toservers. Packets can also be sent to virtual IP addresses in order totest the load-balancing function.

[0465] A Server Administration port may be assigned to a VLAN that isassociated with a virtual domain, or it may be assigned to the ServerAdministration VLAN. The former is straightforward—the packets areforwarded only to servers that are part of the virtual domain. Thelatter case is more complicated, as the packets received on the ServerAdministration port can only be sent to a particular server if thatserver's IP address is unique among all server IP addresses known to thePirus box. This uniqueness requirement also applies if the same serveris in two different virtual domains with TCP/UDP port multiplexing.

[0466] 3.4.5 Server Access Port

[0467] A Server Access port is similar to a trunk port on a conventionalLayer 2 switch. It is used to connect to another Pirus box and carry“tagged” traffic for multiple VLANs. This allows one Pirus box toforward a packet to a server connected to another Pirus box.

[0468] The Pirus box will use the IEEE 802.1 Q VLAN trunking format. AVLAN ID will be assigned to the VLAN that is associated with the virtualdomain. This VLAN ID will be carried in the VLAN tag field of the 802.1Qheader.

[0469] 3.4.6 Example of VLAN

[0470]FIG. 30 is illustrative of a VLAN. Referring now to FIG. 30, thePirus box, designated by the pear icon, is shown with 5 slots, each ofwhich has 3 ports. The VLAN configuration is as follows (the syntax fora port is <slot number>.<port number>):

[0471] VLAN 1

[0472] Server ports 1.1, 2.1, 3.1 and 4.3 (denoted in picture by adotted line)

[0473] Router port 4.1 (denoted in picture by a heavy solid line)

[0474] VLAN 2

[0475] Server ports 1.2, 2.2, 3.2 and 4.3 (denoted in picture by adashed line)

[0476] Server Administration port 5.2

[0477] Router port 4.1 (denoted in picture by a heavy solid line)

[0478] VLAN 3

[0479] Server ports 1.3, 2.3, 3.3 and 4.3 (denoted in picture by a solidline)

[0480] Server Administration port 5.3

[0481] Router port 4.1 (denoted in picture by a heavy solid line)

[0482] Server Administration VLAN

[0483] Server Administration port 5.1 (denoted in picture by wide arealink)

[0484] An exemplary virtual domain configuration is as follows:

[0485] Virtual domain www.moe.com

[0486] Virtual IP address 100.1.1.1

[0487] VLAN 1

[0488] Server 2.1.1.1

[0489] Server 2.1.1.2

[0490] Server 2.1.1.3

[0491] Server 2.1.1.4

[0492] Virtual domain www.larry.com

[0493] Virtual IP address 200.1.1.1

[0494] VLAN 2

[0495] Server 1.1.1.1

[0496] Server 1.1.1.2

[0497] Server 1.1.1.3

[0498] Server 1.1.1.4 Port 8001

[0499] Virtual domain www.curly.com

[0500] Virtual IP address 300.1.1.1

[0501] VLAN 3

[0502] Server 1.1.1.1

[0503] Server 1.1.1.2

[0504] Server 1.1.1.3

[0505] Server. 1.1.1.4 Port 8002

[0506] Domain www.larry.com and www.curly.com each have a VLANcontaining 3 servers with the same IP addresses: 1.1.1.1, 1.1.1.2 and1.1.1.3. This functionality allows different customers to have virtualdomains with servers using their own private address space that doesn'tneed to be unique among all the servers known to the Pirus box. Theyalso contain the same server with IP address 1.1.1.4. Note the Portnumber in the configuration. This is an example of TCP/UDP portmultiplexing, where different domains can use the same server, eachusing a unique port number. Domain www.moe.com has servers in their ownaddress space, although server 2.1.1.4 is connected to the same port(4.3) as server 1.1.1.4 shared by the other two domains.

[0507] The administration station connected to port 5.2 is used toadminister the servers in the www.larry.com virtual domain, and thestation connected to 5.3 is used to administer the servers in thewww.curly.com domain. The administration station connected to port 5.1can administer the servers in www.moe.com.

[0508] 4. Filtering Function

[0509] The filtering function of an RNP performs filtering on packetsreceived from an upstream router. This ensures that the physical serversdownstream from the Pirus box are not accessed directly from clientsconnected to the Internet

[0510] 5. Forwarding Function

[0511] The Pirus box will track flows between IP devices. A flow is abi-directional conversation between two connected IP devices; it isidentified by a source IP address, source UDP/TCP port, destination IPaddress, and destination TCP/UDP port.

[0512] A single flow table will contain flow entries for each flowthrough the Pirus box. The forwarding entry content, creation, removaland use are discussed below.

[0513] 5.1 Flow Entry Description

[0514] A flow entry describes a flow and the information necessary toreach the endpoints of the flow. A flow entry contains the followinginformation: Attribute # of bytes Description Source IP address 4 SourceIP address Destination IP address 4 Destination IP address SourceTCP/UDP port 2 Source higher layer port Destination TCP/UDP port 2Destination higher layer port Source physical port 2 Physical port ofthe source Source next-hop MAC address 6 The MAC address of next- hop tosource Destination physical port 2 Physical port of the destinationDestination next-hop MAC 6 MAC address of next- hop to addressdestination NAT IP address 4 Translation IP address NAT TCP/UDP port 2Translation higher layer port Flags 2 Various flags Received packets 2No. packets received from source IP address Transmitted packets 2 No. ofpackets sent to the source IP address Received bytes 4 No. of bytesreceived from source IP address Transmitted bytes 4 No. of bytes sent tosource IP address Next pointer (receive path) 4 Pointer to nextforwarding entry in hash table used in the receive path Next pointer(transmit path) 4 Pointer to next forwarding entry in the hash tableused in the transmit path Transmit path key 4 Smaller key unique amongall flow entries Total 60

[0515] In accordance with the invention, the IP addresses and TCP/UDPports in a flow entry are relative to the direction of the flow.Therefore, a flow entry for a flow will be different in the flow tablesthat handle each direction. This means a flow will have 2 different flowentries, one on the NP that connects to the source of the flow and oneon the NP that connects to the destination of the flow. If the same NPconnects to both the source and destination, then that NP will contain 2flow entries for the flow.

[0516] In one practice of the invention, on an RNP, the first fourattributes uniquely identify a flow entry. The source and destination IPaddresses are globally unique in this context since they both representreachable Internet addresses.

[0517] On an SNP, the fifth attribute is also required to uniquelyidentify a flow entry. This is best described in connection with theexample shown in FIG. 31. As shown therein, a mega-proxy, such as AOL,performs NAT on the source IP address and TCP/UDP port combinations fromthe clients that connect them. Since a flow is defined by source anddestination IP address and TCP/UDP port, the proxy can theoreticallyreuse the same source IP address and TCP/UDP port when communicatingwith different destinations. But when the Pirus box performs loadbalancing and NAT from the virtual IP address to a particular server,the destination IP addresses and TCP/UDP port of the packets may nolonger be unique to a particular flow. Therefore, the virtual domainmust be included in the comparison to find the flow entry. Requiringthat the IP addresses reachable on a Server port be unique across allvirtual domains on that port solves the problem. The flow entry lookupcan also compare the source physical port of the flow entry with thephysical port on which the packet was received.

[0518] A description of the attributes is as follows:

[0519] 5.1.1 Source IP address: The source IP address of the packet.Source TCP/UDP port: The source TCP/UDP port number of the packet.

[0520] 5.1.2 Destination IP address: The destination IP address of thepacket.

[0521] 5.1.3 Destination TCP/UDP port: The destination TCP/UDP portnumber of the packet.

[0522] 5.1.4 Source physical port: The physical port on the Pirus boxused to reach the source IP address.

[0523] 5.1.5 Source next-hop MAC address: The MAC address of thenext-hop to the source IP address. This MAC address is reachable out thesource physical port and may be the host that owns the IP address.

[0524] 5.1.6 Destination physical port: The physical port on the Pirusbox used to reach the destination IP address.

[0525] 5.1.7 Destination next-hop MAC address: The MAC address of thenext-hop to the destination IP address. This MAC address is reachableout the destination physical port and may be the host that owns the IPaddress.

[0526] 5.1.8 NAT IP address: The IP address that either the source ordestination IP addresses must be translated to. If the source IP addressin the flow entry represents the source of the flow, then this addressreplaces the destination IP address in the packet. If the source IPaddress in the flow entry represents the destination of the flow, thenthis address replaces the source IP address in the packet.

[0527] 5.1.9 NAT TCP/UDP port: The TCP/UDP port that either the sourceor destination TCP/UDP port must be translated to. If the source TCP/UDPport in the flow entry represents the source of the flow, then this portreplaces the destination TCP/UDP port in the packet. If the sourceTCP/UDP port in the flow entry represents the destination of the flow,then this port replaces the source TCP/UDP port in the packet.

[0528] 5.1.10 Flags: Various flags can be used to denote whether theflow entry is relative to the source or destination of the flow, etc.

[0529] 5.1.11 Received packets: The number of packets received with asource IP address and TCP/UDP port equal to that in the flow entry.

[0530] 5.1.12 Transmitted packets: The number of packets transmittedwith a destination IP address and TCP/UDP port equal to that in the flowentry.

[0531] 5.1.13 Received bytes: The number of bytes received with a sourceIP address and TCP/UDP port equal to that in the flow entry.

[0532] 5.1.14 Transmitted bytes: The number of bytes transmitted with adestination IP address and TCP/UDP port equal to that in the flow entry.

[0533] 5.1.15 Next pointer (receive path): A pointer to the next flowentry in the linked list. It is assumed that a hash table will be usedto store the flow entries. This pointer will be used to traverse thelist of hash collisions in the hash done by the receive path (seebelow).

[0534] 5.1.16 Next pointer (transmit path): A pointer to the next flowentry in the linked list. It is assumed that a hash table will be usedto store the flow entries. This pointer will be used to traverse thelist of hash collisions in the hash done by the transmit path (seebelow).

[0535] 5.2 Adding Forwarding Entries

[0536] 5.2.1 Client IP Addresses:

[0537] A client IP address is identified as a source IP address in apacket that has a destination IP address that is part of a virtualdomain. A flow entry is created for client IP addresses by theload-balancing function. A packet received on a Router or Server port ismatched against the is configured policies of a virtual domain. If aphysical server is chosen to receive the packet, a flow entry is createdwith the following values: Attribute Value Source IP address the sourceIP address from the packet Destination IP address the destination IPaddress from the packet Source TCP/UDP port the source TCP/UDP port fromthe packet Destination TCP/UDP port the destination TCP/UDP port fromthe packet Source physical port the physical port on which the packetwas received Source next-hop MAC address source MAC address of thepacket Destination physical port the physical port connected to theserver Destination next-hop MAC the MAC address of the server addressNAT IP address IP address of the server chosen by the load-balancingfunction NAT TCP/UDP port TCP/UDP port number of the chosen server.

[0538] This may be different from the destination TCP/UDP port if portmultiplexing is used

[0539] Flags Can be determined

[0540] In one practice of the invention, the flow entry will be added totwo hash tables. One hash table is used to lookup a flow entry givenvalues in a packet received via a network interface. The other hashtable is used to lookup a flow entry given values in a packet receivedvia the switch fabric. Both hash table index values will most likely bebased on the source and destination IP addresses and TCP/UDP portnumbers.

[0541] In accordance with the invention, if the packet of the new flowis received on a Router port, then the newly created forwarding entryneeds to be sent to the NPs of all other Router ports. The NP connectedto the flow destination (most likely a Server port; could it be a Routerport?) will rewrite the flow entry from the perspective of packetsreceived on that port that will be sent to the source of the flow:Attribute Value Source IP address original NAT IP address Destination IPaddress original source IP address Source TCP/UDP port original NATTCP/UDP port Destination TCP/UDP port original source TCP/UDP portSource physical port original destination physical port Source next-hopMAC address original destination MAC address Destination physical portoriginal source physical port Destination next-hop MAC address originalsource MAC address NAT IP address original destination IP address NATTCP/UDP port original destination TCP/UDP port Flags Can be determined

[0542] 5.2.2 Virtual Domain IP Addresses:

[0543] Virtual domain IP addresses are those that identify the domain(such as www.fred.com) and are visible to the Internet. The “next hop”of these IP addresses is the load balancing function. In one practice ofthe invention, addition of these IP addresses is performed by themanagement software when the configuration is read. Attribute Value IPaddress the virtual IP address TCP/UDP port zero if the servers in thevirtual domain accept all TCP/UDP port numbers; otherwise, a separateforwarding entry will exist with each TCP/UDP port number that issupported Destination IP address zero Destination TCP/UDP port zeroPhysical port n/a Next-hop MAC address n/a Server IP address n/a ServerTCP/UDP port n/a Server physical port n/a Flags an indicator thatpackets destined to this IP address and TCP/UDP port are to beload-balanced

[0544] 5.2.3 Server IP Addresses:

[0545] Server IP addresses are added to the forwarding table by themanagement software when the configuration is read.

[0546] The forwarding function will periodically issue ARP requests forthe IP address of each physical server. It is beyond the scope of theIFF function as to exactly how the physical servers are known, be itmanual configuration or dynamic learning. In any case, since theadministrator shouldn't have to specify the port that connects to thephysical servers, this will require that the Pirus box determine it. ARPrequests will need to be sent out every port connected to an SNP untilan ARP response is received from a server on a port. Once a server's IPaddress has been resolved, periodic ARP requests to ensure the server isstill alive can be sent out the learned port. A forwarding entry will becreated once an ARP response is received. A forwarding entry will beremoved (or marked invalid) once an entry times out.

[0547] If the ARP information for the server times out, subsequent ARPrequests will again need to be sent out all SNP ports. An exponentialbackoff time can be used so that servers that are turned off will notresult in significant bandwidth usage.

[0548] For servers connected to the Pirus box via a router, ARP requestswill be issued for the IP address of the router interface. AttributeValue IP address the server's IP address TCP/UDP port TBD Destination IPaddress zero Destination TCP/UDP port zero Physical port n/a Server IPaddress n/a Server TCP/UDP port n/a Server physical port n/a Flags TBD

[0549] 5.3 Distributing the Forwarding Table:

[0550] In one practice of the invention, as physical servers arelocated, their IP address/port combinations will be distributed to allRNPS. Likewise, as upstream routers are located, their IP address/MACaddress/port combinations will be distributed to all SNPs.

[0551] 5.4 Ingress Function:

[0552] It is assumed that the Ethernet frame passes the CRC check beforethe packet reaches the forwarding function and that frames that don'tpass the CRC check are discarded. As it is anticipated that the RNP willbe heavily loaded, the IP and TCP/UDP checksum validation can beperformed by the SNP. Although it is probably not useful to perform theforwarding function if the packet is corrupted because the data used bythose functions may be invalid, the process should still work.

[0553] After the load balancing function has determined a physicalserver that should receive the packet, the forwarding function performsa lookup on the IP address of the server. If an entry is found, thisforwarding table entry contains the port number that is connected to theserver, and the packet is forwarded to that port. If no entry is found,the packet is discarded. The load balancing function should never choosea physical server whose location is unknown to the Pirus box.

[0554] On packets received a packet from a server, the forwardingfunction performs a lookup on the IP address of the upstream router. Ifan entry is found, the packet is forwarded to the port contained in theforwarding entry.

[0555] The ingress function in the RNP calls the load balancing functionand is returned the following (any value of zero implies that the oldvalue should be used)

[0556] 1. new destination IP address

[0557] 2. new destination port

[0558] The RNP will optionally perform Network Address Translation, orNAT, on the packets that arrive from the upstream router. This isbecause the packets from the client have a destination IP address of thedomain (i.e., www.fred.com). The new destination IP address of thepacket is that of the actual server that was chosen by the loadbalancing function. In addition, a new destination port may be chosen ifTCP/UDP port multiplexing is in use. Port multiplexing may be used onthe physical servers in order to conserve IP addresses. A single servermay serve multiple domains, each with a different TCP/UDP port number.

[0559] The SNP will optionally perform NAT on the packets that arrivefrom a server. This is because there may be a desire to hide the detailsof the physical servers that provide the load balancing function andhave it appear as if the domain IP address is the “server”. The newsource of the packet is that of the domain. As the domain may havemultiple IP addresses, the Pirus box needs a client table that maps theclient's IP address and TCP/UDP port to the domain IP address and portto which the client sent the original packet.

[0560] 6. Egress Function:

[0561] Packets received from an upstream router will be forwarded to aserver. The forwarding function sends the packet to the SNP providingsupport for the server. This SNP performs the egress function to do thefollowing:

[0562] 1. verify the IP checksum

[0563] 2. verify the TCP or UDP checksum

[0564] 3. change the destination port to that of the server (asdetermined by the load balancing function call in the ingress function)

[0565] 4. change the destination IP address to that of the server (asdetermined by the load balancing function call in the ingress function)

[0566] 5. recalculate the TCP or UDP checksum if the destination port ordestination IP address was changed

[0567] 6. recalculate the IP header checksum if the destination IPaddress was changed

[0568] 7. sets the destination MAC address to that of the server ornext-hop to the server (as determined by the forwarding function)

[0569] 8. recalculate the Ethernet packet CRC if the destination port ordestination IP address was changed

[0570] Packets received from a server will be forwarded to an upstreamrouter. The SNP performs the egress function to do the following:

[0571] 1. verify the IP checksum

[0572] 2. verify the TCP or UDP checksum

[0573] 3. change the source port to the one that the client sent therequest to (as determined by the ingress function client table lookup)

[0574] 4. change the source IP address to the one that the client sentthe request to (as determined by the ingress function client tablelookup)

[0575] 5. recalculate the TCP or UDP checksum if the source port orsource IP address was changed

[0576] 6. recalculate the IP header checksum if the destination IPaddress was changed

[0577] 7. sets the destination MAC address to that of the upstreamrouter

[0578] 8. recalculate the Ethernet packet CRC if the source port orsource IP address was changed

V. IP-Based Storage Management—Device Discovery & Monitoring

[0579] In data networks based on IP/Ethernet technology a set ofstandards has developed that permit users to manage/operate theirnetworks using a heterogeneous collection of hardware and software.These standards include Ethernet, Internet Protocol (IP), InternetControl Message Protocol (ICMP), Management Information Block (MIB) andSimple Network Management Protocol (SNMP). Network Management Systems(NMS) such as HP Open View utilize these standards to discover andmonitor network devices.

[0580] Storage Area Networks (SANs) use a completely different set oftechnology based on Fibre Channel (FC) to build and manage “StorageNetworks”. This has led to a “re-inventing of the wheel” in many cases.Also, SAN devices do not integrate well with existing IP-basedmanagement systems.

[0581] Lastly, the storage devices (Disks, Raid Arrays, etc), which areFibre Channel attached to the SAN devices, do not support IP (and theSAN devices have limited IP support) and the storage devices cannot bediscovered/managed by IP-based management systems. There are essentiallytwo sets of management products—one for the IP devices and one for thestorage devices.

[0582] A trend is developing where storage networks and IP networks areconverging to a single network based on IP. However, conventionalIP-based management systems can not discover FC attached storagedevices.

[0583] The following discussion explains a solution to this problem, intwo parts. The first aspect is device discovery, the second is devicemonitoring.

[0584] Device Discovery

[0585]FIG. 32 illustrates device discovery in accordance with theinvention. In the illustrated configuration the NMS cannot discover(“see”) the disks attached to the FC Switch but it can discovery (“see”)the disks attached to the Pirus System. This is because the Pirus Systemdoes the following:

[0586] Assigns an IP address to each disk attached to it.

[0587] Creates an Address Resolution Protocol (ARP) table entry for eachdisk. This is a simple table that contains a mapping between IP andphysical addresses.

[0588] When the NMS uses SNMP to query the Pirus System, the PirusSystem will return an ARP entry for each disk attached to it.

[0589] The NMS will then “ping” (send ICMP echo request) for each ARPentry it receives from the Pirus System.

[0590] The Pirus System will intercept the ICMP echo requests destinedfor the disks and translate the ICMP echo into a SCSI Read Block 0request and send it to the disk.

[0591] If the SCSI Read Block 0 request successfully completes then thePirus System acknowledges the “ping” by sending back an ICMP echo replyto the NMS.

[0592] If the SCSI Read Block 0 request fails then the Pirus System willnot respond to the “ping” request.

[0593] The end result of these actions is that the NMS will learn aboutthe existence of each disk attached to the Pirus System and verify thatit can reach it. The NMS has now discovered the device.

[0594] Device Monitoring

[0595] Once the device (disk) has been discovered by the NMS it willstart sending it SNMP requests to learn what the device can do (i.e.,determine its level of functionality.) The Pirus System will interceptthese SNMP requests and generate a SCSI request to the device. Theresponse to the SCSI request will be converted back into an SNMP replyand returned to the NMS. FIG. 33 illustrates this.

[0596] The configuration illustrated in FIG. 33 is essentially an SNMP<-> SCSI converter/translator.

[0597] Lastly, NMS can receive asynchronous events (traps) from devices.These are notifications of events that may or may not need attention.The Pirus System will also translate SCSI exceptions into SNMP traps,which are then propagated to the NMS. FIG. 34 illustrates this.

VI. DATA STRUCTURE LAYOUT

[0598] Data Structure Layout: FIG. 35 shows the relationships betweenthe various configuration data structures. Each data structure isdescribed in detail following the diagram. The data structures are notlinked; however, the interconnecting lines in the diagram displayreferences from one data structure to another. These references are viainstance number.

[0599] Data Structure Descriptions:

[0600] 1. VSD_CFG_T: This data structure describes a Virtual StorageDomain. Typically there is a single VSD for each end user customer ofthe box. A VSD has references to VLANS that provide information on portsallowed access to the VSD. VSE structures provide information for thestorage available to a VSD and SERVER_CFG_T structures provideinformation on CPUs available to a VSD. A given VSD may have multipleVSE and SERVER structures.

[0601] 2. VSE_CFG_T: This data structure describes a Virtual StorageEndpoint. VSEs can be used to represent Virtual Servers (NAS) orIP-accessible storage (ISCSI, SCSI over UDP, etc.). They are alwaysassociated with one, and only one, VSD.

[0602] 3. VlanConfig: This data structure is used to associate a VLANwith a VSD. It is not used to create a VLAN.

[0603] 4. SERVER_CFG_T: This data structure provides informationregarding a single CPU. It is used to attach CPUs to VSEs and VSDs. Forreplicated NFS servers there can be more than one of these datastructures associated with a given VSE.

[0604] 5. MED_TARG_CFG_T: This data structure represents the endpointfor Mediation Target configuration: a device on the FibreChannelconnected to the Pirus box being accessed via some form of SCSI over IP.

[0605] 6. LUN_MAP _CFG_T: This data structure is used for mappingMediation Initiator access. It maps a LUN on the specified Pirus FC portto an IP/LUN pair on a remote ISCSI target.

[0606] 7. FILESYS_CFG_T: This data structure is used to represent a filesystem on an individual server. There may be more than one of theseassociated with a given server. If this file system will be part of areplicated NFS file system, the filesystem_id and the mount point willbe the same for each of the file systems in the replica set.

[0607] 8. SHARE_CFG_T: This data structure is used to provideinformation regarding how a particular file system is being shared. Theinformation in this data structure is used to populate the sharetab fileon the individual server CPUs.

EXAMPLES Server Health

[0608] 1) Listen for VSD_CFG_T. When get one, create local VSD structure

[0609] 2) Listen for VSE_CFG_T. When get one, wire to local VSD.

[0610] 3) Listen for SERVER_CFG_T. When get one, wire to local VSE.

[0611] 4) Start Server Health for server.

[0612] 5) Listen for FILESYS_CFG_T. When get one, wire to localSERVERNSE.

[0613] 6) Start Server Health read/write to file system.

[0614] 7) Listen for MED_SE_CFG_T. When get one, wire to local VSE.

[0615] 8) Start Server Health pings on IP specified in VSE referenced byMED_SE_CFG_T.

Mediation Target

[0616] 1) Listen for VSE_CFG_T. When get one with type of MED, createlocal VSE structure.

[0617] 2) Listen for MED_SE_CFG_T. When get one, wire to local VSE.

[0618] 3) Setup mediation mapping based on information provided inVSE/MED_SE pair.

Mediation Initiat r

[0619] 1) Listen for LUN_MAP_CFG_T. When get one, request associatedSERVER_CFG_T from MIC.

[0620] 2) Create local SERVER structure.

[0621] 3) Add information from LUN_MAP_CFG_T to LUN map for that server.

NCM

[0622] 1) Listen for SHARE_CFG_T with a type of NFS.

[0623] 2) Request associated FlLESYS_CFG_T from MIC.

[0624] 3) If existing filesystem_id, add to set. If new, create newreplica set.

[0625] 4) Bring new file system up to date. When finished, sendFILESYS_CFG_T with state of “ONLINE”.

[0626] The above features of the Pirus System allow storage devicesattached to a Pirus System be discovered and managed by an IP-based NMS.This lets users apply standards based; widely deployed systems thatmanage IP data networks manage storage devices—something currently notpossible.

[0627] Accordingly, the Pirus System permits for the integration ofstorage (non-IP devices) devices (e.g., disks) into IP-based managementsystems (e.g., NMS), and thus provides unique features andfunctionality.

VII. NAS Mirroring and Content Distribution

[0628] The following section describes techniques and subsystems forproviding mirrored storage content to external NAS clients in accordancewith the invention.

[0629] The Pirus SRC NAS subsystem described herein provides dynamicallydistributed, mirrored storage content to external NAS clients, asillustrated in FIG. 36. These features provide storage performancescalability and increased availability to users of the Pirus system. Thefollowing describes the design of the SRC NAS content distributionsubsystem as it pertains to NAS servers and NAS management processes.Load Balancing operations are described elsewhere in this document.

[0630] 1. Content Distribution and Mirroring

[0631] 1.1 Mirror Initialization via NAS

[0632] After volume and filesystem initialization—a complete copy of afilesystem can be established using the normal NAS facilities (createand write) and the maintenance procedures described hereinafter. Acurrent filesystem server set is in effect immediately after filesystemcreation using this method.

[0633] 1.2 Mirror Initialization via NDMP

[0634] A complete filesystem copy can also be initialized via NDMP.Since NDMP is a TCP based protocol and TCP based load balancing is notinitially supported, the 2nd and subsequent members of a NAS peer setmust be explicitly initialized. This can be done with additional NDMPoperations. It can also be accomplished by the filesystemsynchronization facilities described herein. Once initialization iscomplete a current filesystem server set is in effect.

[0635] 1.3 Sparse Content Distribution

[0636] Partial filesystem content replication can also be supported.Sparse copies of a filesystem will be dynamically maintained in responseto IFF and MIC requests. The details of MIC and IXP interaction can beleft to implementers, but the concept of sparse filesystems and theirmaintenance is discussed herein.

[0637] 2. NCM

[0638] The NCM (NAS Coherency Manager) is used to maintain file handlesynchronization, manage content distribution, and coordinate filesystem(re)construction. The NCM runs primarily on an SRC's 9th processor withagents executing on LIC IXPs and SRC 750's within the chassis.Inter-chassis NAS replication is beyond the scope of this document.

[0639] 2.1 NCM Objectives

[0640] One of the primary goals of the NCM is to minimize the impact ofmirrored content service delivery upon individual NAS servers. NASservers within the Pirus chassis will operate as independent peers whilethe NCM manages synchronization issues “behind the scenes.”

[0641] The NCM will be aware of all members in a Configured FilesystemServer Set. Individual NAS servers do not have this responsibility.

[0642] The NCM will resynchronize NAS servers that have fallen out ofsync with the Configured Filesystem Server Set, whether due to transientfailure, hard failure, or new extension of an existing group.

[0643] The NCM will be responsible for executing content re-distributionrequests made by IFF load balancers when sparse filesystem copies aresupported. The NCM will provide Allocated Inode and Content Inode liststo IFF load balancers.

[0644] The NCM will be responsible for executing content re-distributionrequests made by the MIC when sparse filesystem copies are supported.Note that rules should exist for run-time contradictions between IXP andMIC balancing requests.

[0645] The NCM will declare NAS server “life” to interested parties inthe chassis and accept “death notices” from server health relatedservices.

[0646] 2.2 NCM Architecture

[0647] 2.3 NCM Processes and Locations

[0648] The NCM has components executing at several places in the Piruschassis.

[0649] The primary NCM service executes on an SRC 9th processor.

[0650] An NCM agent runs on each SRC 750 CPU that is loaded for NAS.

[0651] An NCM agent runs on each IXP that is participating in a VSD.

[0652] A Backup NCM process will run on a 2nd SRC's 9th processor. Ifthe primary NCM becomes unavailable for any reason the secondary NCMwill assume its role.

[0653] 2.4 NCM and IPC Services

[0654] The NCM will use the Pirus IPC subsystem to communicate with IFFand NAS server processors.

[0655] The NCM will receive any and all server health declarations, aswell as any IFF initiated server death announcement. The NCM willannounce server life to all interested parties via IPC.

[0656] Multicast IPC messages should be used by NCM agents whencommunicating with the NCM service. This allows the secondary NCM toremain synchronized and results in less disruptive failover transitions.

[0657] After chassis initialization the MIC configuration system willinform the NCM of all Configured Filesystem Server Sets via IPC. Anyuser configured changes to Filesystem Server Sets will be relayed to theNCM via IPC.

[0658] NCM will make requests of NCM agents via IPC and accept theirrequests as well.

[0659] 2.5 NCM and Inode Management

[0660] All file handles (inodes) in a Current Filesystem Server Setshould have identical interpretation.

[0661] The NCM will query each member of a Configured Filesystem ServerSet for InodeList-Allocated and InodeList-Content after initializationand after synchronization. The NCM may periodically repeat this requestfor verification purposes.

[0662] Each NAS server is responsible for maintaining these 2 filehandle usage maps on a per filesystem basis. One map represents allallocated inodes on a server—IN-Alloc. The 2nd usage map represents allinodes with actual content present on the server-IN-Content. On serverswhere full n-way mirroring is enabled the 2 maps will be identical. Onservers using content sensitive mirroring the 2nd “content” map will bea subset of the first. Usage maps will have a global filesystemcheckpoint value associated with them.

[0663] 2.6 Inode Allocation Synchronization

[0664] All peer NAS servers must maintain identical file system and filehandle allocations.

[0665] All inode creation and destruction operations must be multicastfrom IXP/IFF source to an entire active filesystem server set. Thesemulticast packets must also contain a sequence number that uniquelyidentifies the transaction on a per IXP basis.

[0666] Inode creation and destruction will be serialized withinindividual NAS servers.

[0667] 2.7 Inode Inconsistency Identification

[0668] When an inode is allocated, deallocated or modified, themulticasting IXP must track the outstanding request, reportinconsistency or timeout as a NAS server failure to the NCM.

[0669] When all members of a current filesystem server set timeout on asingle request the IXP must consider that the failure is one of thefollowing events:

[0670] IXP switch fabric multicast transmission error

[0671] Bogus client request

[0672] Simultaneous current filesystem server set fatality

[0673] The 3rd item is least likely and should only be assumed when thefirst 2 bullets can be ruled out.

[0674] NAS servers must track the incoming multicast sequence numberprovided by the IXP in order to detect erroneous transactions as soon aspossible. If a NAS server detects a missing our out of order multicastsequence number it must negotiate its own death with NCM. If all membersof a current filesystem server set detect the same missing sequencenumber then the negotiation fails and the current filesystem server setshould remain active.

[0675] When an inconsistency is identified the offending NAS server willbe reset and rebooted. The NCM is responsible for initiating thisprocess. It may be possible to gather some “pre-mortem” information andpossibly even undo final erroneous inode allocations prior to rebooting.

[0676] 3. Filesystem Server Sets

[0677] 3.1 Types

[0678] For a given filesystem, there are 3 filesystem server sets thatpertain to it; configured, current and joining.

[0679] As described in the definition section, the configured filesystemserver set is what the user specified as being the cpus that he wants toserve a copy of the particular filesystem. To make a filesystem readyfor service a current filesystem server set must be created. As serverspresent themselves and their copy of the filesystem to the NCM and aredetermined to be part of the configured server set, the NCM mustreconcile their checkpoint value for the filesystem with either thecurrent set's checkpoint value or the checkpoint value of joiningservers in the case where a current filesystem server set does not yetexist.

[0680] A current filesystem server set is a dynamic grouping of serversthat is identified by a filesystem id and a checkpoint checkpoint value.The current filesystem server set for a filesystem is created andmaintained by the NCM. The joining server set is simply the set of NASservers that are attempting to be part of the current server set.

[0681] 3.2 States of the Current Server Set

[0682] A current filesystem server set can be active, inactive, orpaused. When it is active, NFS requests associated with the filesystemid are being forwarded from the IXPs to the members of the set. When theset is inactive the IXPs are dropping NFS requests to the server set.When the set is paused, the IXPs are queuing NFS requests destined forthe set.

[0683] When a current filesystem server set becomes active and isserving clients and a new server wishes to join the set, we must atleast pause the set to prevent updates to the copies of the filesystemduring the join operation. The benefit of a successful pause andcontinue versus deactivate and activate is that NFS clients may not needto retransmit requests that were sent while the new server was joining.There clearly are limits to how many NFS client requests you can queuebefore you are forced to drop. Functionally both work. A first passcould leave out the pause and continue operations until later.

[0684] 4. Description of Operations on a Current Filesystem Server Set

[0685] During the lifetime of a current filesystem set, for recoverypurposes several items of information must be kept somewhere where anNCM can find them after a fault

[0686] 4.1 Create_Current_Filesystem_Server_Set(fsid, slots/cpus)

[0687] Given a set of cpus that are up, configured to serve thefilesystem, and wishing to join, the NCM must decide which server hasthe latest copy of the filesystem, and then synchronize the otherjoining members with that copy.

[0688] 4.2 Add_Member_To_Current Filesyst m_Server_S t(fsid, slot/cpu)

[0689] Given a cpu that wishes to join, the NCM must synchronize thatcpu's copy of the filesystem with the copy being used by the currentfilesystem server set.

[0690] Checkpoint_Current_Filesystem_Server_Set(fsid)

[0691] Since a filesystem's state is represented by its checkpoint valueand modified Inode-Lists, the time to recover from a filesystem with thesame checkpoint value is a function of the modifications represented bythe modified InodeList, it is desirable to checkpoint the filesystemregularly. The NCM will coordinate this. A new checkpoint value willthen be associated with the copies served by the current filesystemserver set and the modified InodeList on each member of the set will becleared. Get Status Of Filesystem Server Set(fsid, &status struct)Return the current state of the filesystem server set. Structserver_set_status { Long configured_set; Long current_set; Longcurrent_set_checkpoint_value; Long joining_set; Int active_flag; };

[0692] 5. Description of Operations that Change the State of the CurrentServer Set

[0693] 5.1 Activate_Server_Set(fsid)

[0694] Allow NFS client requests for this fsid to reach the NFS serverson the members of the current filesystem server set.

[0695] 5.2 Pause Filesystem Server Set(fsid)

[0696] Queue NFS client requests for this fsid headed for the NFSservers on the members of the current filesystem server set. Note anyqueue space is finite so pausing for too long can result in droppedmessages. This operation waits until all pending NFS modification ops tothis fsid have completed.

[0697] 5.3 Continue Filesystem Server Set(fsid)

[0698] Queued NFS client requests for this fsid are allow to proceed tothe NFS servers on members of the current filesystem server set.

[0699] 5.4 Deactivate_Server_Set(fsid)

[0700] Newly arriving NFS requests for this fsid are now dropped. Thisoperation waits until all pending NFS modification ops to this fsid havecompleted.

[0701] 6. Recovery Operations on a Filesystem Copy

[0702] There are two cases of Filesystem Copy:

[0703] 6.1 Construction: refers to the Initialization of a “filesystemcopy”, which will typically entail copying every block from the Sourceto the Target. Construction occurs when the Filesystem SynchronizationNumber does not match between two filesystem copies.

[0704] 6.2 Restoration: refers to the recovery of a “filesystem copy”.

[0705] Restoration occurs when the Filesystem Synchronization Numbermatches between two filesystem copies.

[0706] Conceptually, the two cases are very similar to one another.There are three phases of each Copy:

[0707] I. First-pass: copy-method everything that has changed since thelast Synchronization. For the Construction case, this really is EVERYthing; for the Restoration case, this is only the inodes in the IN-Modlist.

[0708] II. Copy-method the IN-Copy list changes, i.e. modificationswhich occurred while the first phase was being done.

[0709] Repeat until the IN-Copy list is (mostly) empty; even if it isnot empty, it is possible to proceed to synchronization at the cost of alonger synchronization time.

[0710] III. Synchronization by NCM: update of Synchronization Number,clearing of the IN-Mod list. Note that by pausing ongoing operations ateach NAS (and IXP if a new NAS is being brought into the peer group), itis possible to achieve synchronization on-line (i.e. during active NFSmodify operations).

[0711] The copy-method refers to the actual method of copying used ineither the Construction or Restoration cases. It is proposed here thatthe copy-method will actually hide the differences between the twocases.

[0712] 6.3 NAS-FS-Copy

[0713] An NAS-FS copy inherently utilizes the concept of “inodes” toperform the Copy. This is built-into both the IN-Mod and IN-Copy listsmaintained on each NAS.

[0714] 6.3.1 Construction of Complete Copy

[0715] Use basic volume block-level mirroring to make “first pass” copyof entire volume, from Source to Target NAS. This is an optimization totake advantage of sequential I/O performance; however, this will impactthe copy-method. The copy-method will be an ‘image’ copy, i.e. it is avolume block-by-block copy; conceptually, the result of the Constructionwill be a mirror-volume copy. (Actually, the selection of volumeblock-level copying can be determined by the amount of “used” filesystemspace; i.e. if the filesystem were mostly empty, it would be better touse an inode logical copy as in the Restoration case.)

[0716] For this to work correctly, since a physical-copy is being done,the completion of the Copy (i.e. utilizing the IN-Copy) must also bedone at the physical-copy level; stated another way, the “inode”copy-method must be done at the physical-copy level to complete theCopy.

[0717] 6.4 Copy-method

[0718] The inode copy-method must exactly preserve the inode: this isnot just the inode itself, but also includes the block mappings. Forexample, copying the 128 b of the inode will only capture the Direct,2nd-level, and 3rd-level indirect blocks; it will not capture the datain the Direct, nor the levels of indirection embedded in both the2nd/3rd indirect blocks. In effect, the indirect blocks of an inode (ifthey exist) must be traversed and copied exactly; another way to statethis, the list of all block numbers allocated to an inode must becopied.

[0719] 6.5 Special Inodes:

[0720] Special inodes will be instantiated in both IN-Mod and IN-Copywhich reflect changes to filesystem metadata: specificallyblock-allocation and inode-allocation bitmaps (or alternatively for eachUFS' cylinder-group), and superblocks. This is because all physicalchanges (i.e. this is a physical-image copy) must be captured in thiscopy-method.

[0721] 6.6 Locking:

[0722] Generally, any missed or overlapping updates will be caught byrepeating IN-Copy changes; any racing allocations and/or de-allocationswill be reflected in both the inode (being extended or truncated), andthe corresponding block-allocation bitmap(s). Note these special inodesare not used for Sparse Filesystem Copies.

[0723] However, while the block map is being traversed (i.e. 2nd/3rdindirect blocks), changes during the traversal must be prevented toprevent inconsistencies. Since the copy-method can be repeated, it wouldbe best to utilize the concept of a soft-lock which would allow anongoing copy-method to be aborted by the owning/Source-NAS if there wasa racing extension/truncation of the file.

[0724] 6.7 Restoration of Complete Copy

[0725] This step assumes that two NAS' differ only in the IN-Mod list;to complete re-Synchronization, it requires that all changed inodes bepropagated from the Source NAS to the Target NAS (since the lastsynchronization-point).

[0726] 6.8 Copy-method

[0727] The Inode copy-method occurs at the logical level: specificallythe copying is performed by performing logical reads of the inode, andno information is needed of the actual block mappings (other than tomaintain sparse-inodes). Recall the Construction case required aphysical-block copy of the inode block-maps (i.e. block-map treetraversal), creating a physical-block mirror-copy of the inode

[0728] 6.9 Special Inodes

[0729] No special inodes are needed; because per-filesystem metadata isnot propagated for a logical copy.

[0730] 6.10 Locking

[0731] Similarily (to the construction case), a soft-lock around aninode is all that is needed.

[0732] 6.11 Data structures

[0733] There are two primary Lists: the IN-Mod and the IN-Copy list. TheIN-Copy is logically nested within the IN-Copy.

[0734] 6.11.1 Modified-Inodes-list (IN-Mod)

[0735] The IN-Mod is the list of all modified inodes since the lastFilesystem Checkpoint:

[0736] Worst-case, if an empty filesystem was restored from backup, thelist would encompass every allocated inode.

[0737] Best-case, an unmodified filesystem will have an empty-list; or afilesystem with a small working-set of inodes being modified will have a(very) small list.

[0738] The IN-Mod is used as a recovery tool, which allows the owningNAS to be used as the ‘source’ for a NAS-FS-Copy. It allows the NCM todetermine which inodes have been modified since the last FilesystemCheckpoint.

[0739] The IN-Mod is implemented non-volatile, primarily for the case ofchassis crashes (i.e. all NAS' crash), as one IN-Mod must exist torecover. Conceptually, the IN-Mod can be implemented as a Bitmap, or, asa List.

[0740] The IN-Mod tracks any modifications to any inode by a given NAS.This could track any change to the inode ‘object’ (i.e. both inodeattributes, and, inode data), or, differentiate between the inodeattributes and the data contents.

[0741] The IN-Mod must be updated (for a given inode) before it iscommitted to non-volatile storage (i.e. disk, or NVRAM); otherwise,there is a window where the system could crash and the change not bereflected in the IN-Mod. In a BSD implementation, the call to add amodified inode to the IN-Mod could be done in VOP_UPDATE.

[0742] Finally, the Initialization case requires ‘special’ inodes toreflect non-inode disk changes, specifically filesystem metadata; e.g.cylinder-groups, superblocks. Since Initialization is proposing to use ablock-level copy, all block-level changes need to be accounted for bythe IN-Mod.

[0743] 6.11.2 Copy-Inodes-list (IN-Copy)

[0744] The IN-Copy tracks any modifications to an inode by a given NAS,once a Copy is in-progress: it allows a Source-NAS to determine whichinodes still need to be copied because it has changed during the Copy.In other words, it is an interim modified-list, which exists during aCopy. Once the Copying begins, all changes made to the IN-Mod aremirrored in the IN-Copy; this effectively captures all changes “sincethe Copy is in-progress”.

[0745] 6.11.3 Copy Progress:

[0746] The Source NAS needs to know which inodes to copy to the TargetNAS. Conceptually, this is a snapshot ‘image’ of the IN-Mod before theIN-Copy is enabled, as this lists all the inodes which need to be copiedat the beginning of the Copy (and, where the IN-Copy captures allchanges rolling forward). In practice, the IN-Mod itself can be used, atthe minor cost of repeating some of the Copy when the IN-Copy isprocessed.

[0747] Note the IN-Copy need not be implemented in NVRAM, since any NAScrashes (either Source or Target) can be restarted from the beginning.If an IN-Copy is instantiated, the calls to update IN-Copy can be hiddenin the IN-Mod layer.

[0748] 6.11.4 Copying Inodes:

[0749] An on-disk inode is 128 bytes (i.e. this is effectively theinode's aftributes); the inode's data is variable length, and can varybetween 0 and 4 GB, in filesystem fragment-size increments. On-diskinodes tend to be allocated in physically contiguous disk blocks, hencean optimization is copy a large number of inodes all at once.CrosStor-Note: all inodes are stored in a reserved-inode (file) itself.

[0750] 6.11.5 Construction Case

[0751] In this case, locking is necessary to prevent racing changes tothe inode (and or data contents), as the physical image of the inode(and data) needs to be preserved.

[0752] Specifically, the block mapping (direct and indirect blocks) needto be preserved exactly in the inode; so both the block-mapping andevery corresponding block in the file have to be written to the samephysical block together.

[0753] As an example, assume the race is where a given file is beingfirst truncated, and then extended. Since each allocated-block needs tobe copied exactly (i.e. same physical block number on the volume), carehas to be taken that the copy does not involve a block in transition.Otherwise, locking on block allocations would have to occur on thesource-NAS. Instead, locking on an inode would seem the betteralternative here. An optimization would be to allow a source-NAS to‘break’ a Copy-Lock, with the realization that an inode being Copiedshould defer to a waiting modification.

[0754] 6.11.6 Restoration Case

[0755] In this case, no locking is implied during an inode-copy, sinceany “racing” modifications will be captured by the IN-Copy. A simpleoptimization might be to abort an in-progress Copy if such a ‘race’ isdetected; e.g., imagine a very large file Copy which is being modified.

[0756] Specifically, the inode is copied, but not the block-mapping; thefile data (represented by the block-mapping) is logically copied to thetarget NAS.

Examples—Set 1

[0757] 1. Walkthroughs of Operations on a Current Filesystem Server SetCreate_Current_Server_Set(fsid, slots/cpus) Assumptions

[0758] Assume that no NAS server is serving the filesystem; the currentfilesystem server set is empty.

[0759] Steps

[0760] NAS A boots and tells the NCM it is up.

[0761] The NCM determines the new servers role in serving and that thefilesystem is not being served by any NAS servers.

[0762] The NCM asks server A for the checkpoint value for the filesystemand also its modified InodeList.

[0763] The NCM insures that this is the most up to date copy of thefilesystem. (Reconciles static configuration info on filesystem withwhich servers are actually running, looks in NVRAM if needed . . . )

[0764] NCM activates the server set.

[0765] The filesystem is now being served.

[0766] Add_Member_to_Current_Filesystem_Server_Set(fsid) Assumptions

[0767] Assume a complete copy of the filesystem is already being served.

[0768] The current filesystem server set contains NAS B.

[0769] The current filesystem server set is active.

[0770] NAS A is down.

[0771] NAS A boots and tells the NCM it is up.

[0772] Steps

[0773] The NCM determines the new servers role in serving the filesystemand determines the current server set for this filesystem contains onlyNAS B.

[0774] The NCM asks server A for the checkpoint value for the filesystemand also its modified InodeList.

[0775] NCM initiates recovery and asks NAS A to do it.

[0776] NAS A finishes recovery and tells the NCM.

[0777] The NCM pauses the current filesystem server set.

[0778] NCM asks NAS A to do recovery to catch anything that might havechanged since the last recovery request. This should only include NFSrequests received since the last recovery.

[0779] NAS A completes the recovery.

[0780] The NCM asks all members of the set to update their filesystemcheckpoint value. They all respond.

[0781] The NCM resumes the current filesystem server set.

[0782] A new filesystem checkpoint has been reached.

Checkpointing an Active Filesystem Server Set

[0783] Assumptions

[0784] Steps

[0785] NCM determines it is time to bring all the members of the currentserver set to a checkpoint.

[0786] NCM asks the NCM agent on one member of the server set to forwarda multicast filesystem sync message to all members of the current serverset. This message contains a new checkpoint value for the filesystem.

[0787] Upon receipt of this message the NAS server must finishprocessing any NFS requests received prior to the sync message thatapply to the filesystem. New requests must be deferred.

[0788] The NAS server then writes the new checkpoint value to stablestorage and clears any modified InodeLists for the filesystem andupdates the NFS modification sequence number.

[0789] The NAS servers then sends a message to the NCM indicating thatit has reach a new filesystem checkpoint.

[0790] The NCM waits for these messages from all NAS servers.

[0791] The NCM then sends multicast to the current server set tellingthem to start processing NFS requests.

[0792] The NCM then updates it's state to indicate a new filesystemcheckpoint has been reached.

Examples—Set 2

[0793] 2 UML Static Structure Diagram

[0794]FIG. 37 is a representation of the NCM, IXP and NAS serverclasses.

[0795] For each, the top box is the name, the second box containsattributes of an instance of this class and the bottom box describes themethods each class must implement.

[0796] Attributes Description

[0797] Data local to an instance of the class that make it unique.

[0798] Methods Description

[0799] Those preceded with a +are public and usually invoked byreceiving a message. The method is preceded by the name of the sender ofthe message surrounded by <<>>. Calling out the sender in thedescription should help you to correlate the messaging scenariosdescribed in this document to implemented methods in the classes. Thosepreceded by a—are private methods that may be invoked during processingof public methods. They help to organize and reuse functions performedby the class.

VIII. System Mediation Manager

[0800] The following discussion sets forth the functional specificationand design for the Mediation Manager subsystem of the Pirus box.

[0801] Mediation refers to storage protocol mediation, i.e., mediatingbetween two transport protocols (e.g., FC and IP) that carry a storageprotocol (SCSI). The system disclosed herein will use the mediationconfigurations shown in FIGS. 38A, B, C. Thus, for example, in FIG. 38A,the Pirus box terminates a mediation session. In FIGS. 38B and C, PirusBox1 originates a mediation session and Pirus box2 terminates it. InFIG. 38C, Pirus Box1 runs backup software to copy its disks to theother_Pirus box.

[0802] 1. Components

[0803] In accordance with one embodiment of the invention, mediation ishandled by a Mediation Manager and one or more Mediation ProtocolEngines. Their interaction between each other and other parts of thePirus box is shown in FIG. 39.

[0804] 2. Storage Hierarchy

[0805] In accordance with known storage practice, at the lowest level ofstorage, there are physical disks, and each disk has one or more LUNs.In the system of the invention, as shown in the FIG. 40, the Volumemanager configures the disks in a known manner (such as mirroring, RAID,or the like) and presents them to the SCSI server as volumes (e.g., Vol1thru Vol 5). The SCSI server assigns each volume to a Virtual LUN (VL0thru VL2) in a Virtual Target (VT0 through VT1).

[0806] The following behaviors are observed:

[0807] 1. Each Volume corresponds to only one Virtual LUN.

[0808] 2. Each Virtual Target can have one or more Virtual LUNs.

[0809] 3. Each Virtual Target is assigned an IP address.

[0810] 4. A virtual target number is unique in a Pirus box.

[0811] 3. Functional Specification

[0812] In one practice of the invention, the Mediation Manager will beresponsible for configuration, monitoring, and management of theMediation Protocol Engines; and only one instance of the MediationManager will run on each 755 on the SRC. Each Mediation Manager willcommunicate with the MIC and the Mediation Protocol Engines as shown inFIG. 4 above. The MIC provides the configurations and commands, and theMediation Protocol Engines will actually implement the various mediationprotocols, such as iSCSI, SEP, and the like. The Mediation Manager willnot be involved in the actual mediation, hence, it will not be in thedata path.

[0813] 4. Functional Requirements

[0814] 1. In one practice of the invention, the Mediation Manager alwayslistens to receive configuration and command information from the MIC,and sends statistics back to the MIC.

[0815] 2. The Mediation Manager accepts the following configurationinformation from the MIC, and configures the Mediation Protocol

[0816] Engines appropriately:

[0817] a. Add a virtual target

[0818] i. Mediation Protocol

[0819] 1. TCP/UDP port number

[0820] 2. Max inactivity time

[0821] ii.Virtual target number

[0822] iii. IP address

[0823] iv. Number of LUNs

[0824] v. Max number of sessions

[0825] b. Modify a virtual target

[0826] c. Remove a virtual target

[0827] 3. Once configured by the MIC, the Mediation Manager spawns onlyone Mediation Protocol Engine for each configured mediation protocol. AMediation Protocol Engine will handle all the sessions for that protocolto any/all the accessible disks on its Fiber Channel port.

[0828] 4. The Mediation Manager accepts the following commands from theMIC and sends a corresponding command to the appropriate

[0829] Mediation Protocol Engine:

[0830] a. Start/Stop a Mediation Protocol Engine

[0831] b. Abort a session

[0832] c. Get/Reset a stat for a mediation protocol and virtual target

[0833] 5. The Mediation Manager will collect statistics from theMediation Protocol Engines and report them to the MIC. The stats are:

[0834] a. Number of currently established sessions per mediationprotocol per virtual target; this stat is unaffected by a stat reset.

[0835] b. A list of all the sessions for a mediation protocol andvirtual target: virtual LUN, attached server, idle time; this stat isunaffected by a stat reset.

[0836] c. Number of closed sessions due to “inactivity” per mediationprotocol per virtual target.

[0837] d. Number of denied sessions due to “max # of sessions reached”per mediation protocol per virtual target.

[0838] 6. The Mediation Manager will communicate the rules passed downby the MIC to the appropriate Mediation Protocol Engine:

[0839] a. Host Access Control per mediation protocol (in one practice ofthe invention, this will be executed on the LIC)

[0840] i. Deny sessions from a list of hosts/networks

[0841] ii. Accept sessions only from a list of hosts/networks

[0842] b. Storage Access Control per virtual target

[0843] i. Age out a virtual target, i.e., deny all new sessions to avirtual target. This can be used to take a virtual target offline onceall current sessions die down.

[0844] 7. The Mediation Manager (as ordered by the user through the MIC)will send the following commands to the Mediation Protocol Engines:

[0845] a. Start (this may be equivalent to spawning a new engine)

[0846] b. Stop

[0847] c. Abort a session

[0848] d. Get/Reset stats for a mediation protocol and virtual target.

[0849] 8. The Mediation Manager will register to receive ping (ICMP EchoRequest) packets destined for any of its virtual targets.

[0850] 9. Once the Mediation Manager receives a ping (ICMP Echo Request)packet for a virtual target, it will send a request to the “StorageHealth Service” for a status check on the specified virtual target. Oncethe reply comes back from the Storage Health Service, the MediationManager will send back an ICMP Echo Reply packet.

[0851] 10. The Mediation Manager will register to send/receive messagesthrough IPC with the Storage Health Service.

[0852] 5. Design

[0853] In the embodiment shown, only one Mediation Manager task runs oneach 755 on the SRC. It listens for configuration and commandinformation from the MIC to manage the Mediation Protocol Engines. Italso reports back statistics to the MIC. The Mediation Manager spawnsthe Mediation Protocol Engines as tasks when necessary. In addition, italso handles ping (ICMP Echo Request) packets destined to any of itsvirtual targets.

[0854] 6. Data Structures

[0855] In this embodiment, the data structures for keeping track ofvirtual target devices and their corresponding sessions are set up asshown in FIGS. 9-6. In the embodiment shown in FIG. 41, the number ofsupported virtual target devices on a Pirus box is 1024, with eachhaving 256 sessions; and the virtual target devices are different fortermination and origination.

[0856] At startup, the Mediation Manager sets up an array ofMED_TYPE_CFG_T, one for each mediation protocol type: iSCSI, SEP, SCSIover UDP, and FC over IP. It will then allocate an array of pointers foreach virtual target device, DEV_ENTRY_T. Once the MIC configures a newvirtual target device ( for termination or origination ) the MediationManager allocates and links in a MED_DEV_CFG_T structure. Finally, whena new session is established, a MED_SESS_ENTRY_T structure is allocated.

[0857] This structure will provide a reasonable compromise betweenmemory consumption and the speed at which the structure could besearched for a device or session.

[0858] In this practice of the invention, a session id is a 32-bitentity defined as follows to allow for direct indexing into the abovestructure.

[0859] Mediation type is 4 bits which allows for 16 mediation protocoltypes.

[0860] The next single bit indicates whether it is for termination ororigination.

[0861] The next 11 bits represent the device number, basically an indexto the device array.

[0862] The 8 bits of session number is the index into the session array.

[0863] Finally, 8 bits of generation number is used to distinguish oldsessions from current sessions.

[0864] 7. Flow Chart

[0865] In this practice of the invention, there will be one semaphorethat the Mediation Manager will wait upon. Two events will post thesemaphore to awaken the Mediation Manager:

[0866] 1. Arrival of a packet through IPCEP from the MIC

[0867] 2. Arrival of a ping packet

[0868] As indicated in FIG. 42, the Mediation includes the followingsteps:

[0869] Initializing all data structures for mediation 4201;

[0870] Creating two queues: for ping packets and for IPCEP messages4202;

[0871] Registering to receive IPCEP messages from the MIC 4203;

[0872] Registering to receive ping packets from the TCP/IP stack 4204;

[0873] Waiting to receive ping packets from the TCP/IP stack;

[0874] Waiting to receive a ping or IPCEP message;

[0875] Checking whether the received item is an IPCEP message, and ifso,

[0876] Retrieving the message form the queue and checking the messagetype and calling the med_engine API (or similar process) and thenreturning to the “wait to receive” step; or, if not,

[0877] Checking whether it is a ping packet, and if so, retrieving themessage from the queue, processing the ping packet, contacting thestorage health service, and returning to the “wait to receive” step; or

[0878] if not a ping packet, returning to the “wait to receive” step.

IX. Mediation Caching

[0879] The following section describes techniques for utilizing datacaching to improve access times and decrease latency in a mediationsystem according to the present invention. By installing a data cache onthe Client Server (as illustrated in FIG. 43), the local clients canachieve faster access times for the data being served by the DataServer. The cache will provide access to data that has already(recently) been read from the Data Server. In the case where a clientattempts to access a segment of data that has been previously read,either by the same client or any other attached client, the data can bedelivered from the local cache. If the requested data is not in thelocal cache, the read operation must be transmitted to the Data Server,and the server will access the storage system. Once the data istransferred back to the Client Server, the data will be stored in thelocal cache, and be available for other clients to access.

[0880] In a similar fashion, the write performance of the clients can beimproved by employing a Non-Volatile Ram (NVRAM) on the client server.Using the NVRAM, the system can reply to the local clients that thewrite operation is complete as soon as the data is committed to theNVRAM cache. This is possible since the data will be preserved in theNVRAM, and will eventually be written back to the Data Server forcommitment to the storage device by the system. The performance can befurther improved by altering the way in which the NVRAM data cache ismanipulated before the data is sent to the Data Server. The write datafrom the NVRAM can be accumulated such that a large semi-contiguouswrite access can be performed to the data server rather than smallpiecewise accesses. This improves both the data transmit characteristicsbetween the servers as well as improving the storage characteristics ofthe Data Server since a large transfer involves less processorintervention than small transfers.

[0881] This system improves latency on data writes when there is spaceavailable in the write cache because the client writer does not have towait for the write data to be transmitted to the Data Server and becommitted to the storage device before the acknowledgement is generated.The implied guarantees of commitment to the storage device is managed bythe system through the utilization of NVRAM and a system to deliver thedata to the Data Server after a system fault.

[0882] The system improves latency on data reads when the read datasegment is available in the local read cache because the client does nothave to wait for the data transmission from the data server, or thestorage access times before the data is delivered. In the case where thedata is not in the local cache the system performance is no worse that astandard system.

[0883] The system requires that the data in the write cache be availableto the client readers so that data integrity can be maintained. Theorder of operation for read access is

[0884] 1) check the local write cache for data segment match

[0885] 2) (if not found in 1) check the local read cache for datasegment match

[0886] 3) (if not found in 2) issue the read command to the Data Server

[0887] 4) Once the data is transmitted from the Data Server save it inthe local data cache.

[0888] The order of operation for write access is

[0889] 1) check the local read cache for a matching data segment andinvalidate the matching read segments

[0890] 2) check the local write cache for matching write segments andinvalidate (or re-use)

[0891] 3) generate a new write cache entry representing the write datasegments.

[0892]FIG. 43 shows the simple system with one Client Server per Dataserver. Note that the client server can have any number of clients, anda Client Server can target any number of Data Servers.

[0893] The caching mechanism becomes more complex in a system such asthe one shown in FIG. 44. When a system contains more than one ClientServer per Data Server, the cache coherency mechanism must become morecomplex. This is because one client server can modify data that is inthe local cache of the other client server, and the data will not matchbetween the Client Servers.

[0894] Cache coherency can be maintained in the more complex system bydetermining the state of the cache on the Data Server. Before any datacan be served from the Client Server local data cache, a message must besent to the data server to determine if the data in the local data cachemust be updated from the Data Server. One method of determining this isby employing time-stamps to determine if the data in the Client Serverlocal data cache is older than that on the Data Server. If the cache onthe Client Server needs to be updated before the data is served to theclient, a transmission of the data segment from the Data Server willoccur. In this case, the access from the client will look like astandard read operation as if the data were not in the local cache. Thelocal data cache will be updated by the transmission from the DataServer, and the time-stamps will be updated.

[0895] Similarly, in the data write case, the Data Server must beconsulted to see if the write data segments are locked by anotherclient. If the segments are being written by another Client Serverduring the time a new Client Server wants to write the same segments (oroverlapping segments), the new write must wait for the segments to befree (write operation complete from first Client Server). A light weightmessaging system can be utilized to check and maintain the cachecoherency by determining the access state of the data segments on theData Server.

[0896] The order of operation for read access in the complex system isas follows:

[0897] 1) check the local write cache for data segment match

[0898] 2) (if not found in 1) check the local read cache for datasegment match

[0899] 3) (if the segment is found in local cache) send a request to theData Server to determine the validity of the local read cache

[0900] 4) If the local read cache is not valid, or the segment is notfound in the local cache, issue a read operation to the Data Server.

[0901] 5) Once the data is transmitted from the Data Server save it inthe local data cache.

[0902] Note that the case where the data cache is not valid can beoptimized by returning the read data in the event that the local cachedata is invalid. This saves an additional request round-trip.

[0903] The order of operation for write access in the complex system is

[0904] 1) check the local read cache for a matching data segment andinvalidate the matching read segments

[0905] 2) check the local write cache for matching write segments andinvalidate (or re-use)

[0906] 3) Send a message to the Data Server to determine if the writesegment is available for writing (if the segment is not available, waitfor the segment to become available)

[0907] 4) generate a new write cache entry representing the write datasegments.

[0908] 5) Send a message to the Data Server to unlock the data segments.

[0909] Note that in step 3, the message will generate a lock on the datasegment if the segment is available; this saves an additional requestround-trip.

X. Server Health Monitoring

[0910] The following discussion describes the Pirus Server HealthManager, a system process that runs within the Pirus chassis andmonitors the state of storage services that are available to externalclients. Server Health manages the state of Pirus Storage services, anduses that data to regulate the flow of data into the Pirus chassis.Server health will use this information to facilitate load balancing,fast-forwarding or discarding of traffic coming into the system.

[0911] The Pirus Server Health Manager (SHM) is responsible formonitoring the status or health of a target device within the Piruschassis. Pirus target devices can include, for example, NAS andmediation/iSCSI services that run on processors connected to storagedevices.

[0912] In one practice of the invention, the SHM runs on the Pirussystem processor (referred to herein as Network Engine Card or NEC)where NAS or iSCSI storage requests first enter the system. Theserequests are forwarded from this high-speed data path across a switchedfabric to target devices. SHM will communicate with software componentsin the system and provide updated status to the data-forwarding path.

[0913] 1. Operation with Network Access Server (NAS):

[0914] In accordance with the invention, SHM communicates withcomponents on the NAS Storage Resource Card (SRC) to monitor the healthof NFS services. NFS requests are originated from the NEC and insertedinto the data stream along with customer traffic that enters from thehigh-speed data path. Statistics are gathered to keep track of latency,timeouts and any errors that may be returned from the server.

[0915] SHM also exchanges IPC messages with the NFS Coherency Manager(NCM) on the SRC to pass state information between the two processors.Message sequences exchanged between these two systems can originate fromthe NAS or from the NEC.

[0916] 2. Operation with iSCSI/Mediation Devices:

[0917] SHM will also communicate with a Mediation Device Manager (MDM)that runs on a SRC card and manages mediation devices like iSCSI. SHMwill send ICMP messages to each target and wait on responses. Statisticsare also gathered for mediation devices to keep track of latency,timeouts and error codes. IPC messages will also be sent from the NEC toMDM whenever an ICMP request times out.

[0918] Interaction with Data Forwarding Services: Data arrives into thePirus chassis from high-speed network interfaces like Ethernet.Low-level drivers and the Intelligent Filtering and Forwarding (IFF)component, described elsewhere in this document, receive this data. IFFworks with code in the IXP1200 Micro-engine to forward traffic acrossthe backplane to the NAS or iSCSI service.

[0919] 3. Forwarding of NFS Traffic:

[0920] Either a single server or multiple servers within the Piruschassis can consume NFS traffic. It is contemplated that NFS trafficforwarded to a single server will always be sent to the same target CPUacross the backplane as long as that CPU and server are alive andhealthy.

[0921] A group of NFS servers can provide the same ‘virtual’ servicewhere traffic can be forwarded to multiple servers that reside onmultiple CPUs. In this configuration, NFS write and create operationsare replicated to every member of the group, while read operations canbe load balanced to a single member of the group. The forwardingdecision is based on the configured policy along with server health ofeach of the targets.

[0922] Load balancing decisions for read operations may be based on avirtual service (defined by a single virtual IP address) and could be assimple as round-robin, or, alternatively, use a configured weight todetermine packet forwarding. Health of an individual target could dropone of these servers out of the list of candidates for forwarding oraffect the weighting factor.

[0923] Load balancing may also be based on NFS file handles. Thisrequires that server health, IFF and micro-engine code manage state onNFS file handles and use this state for load balancing within thevirtual service. File handle load balancing will work with target serverbalancing to provide optimum use of services within the Pirus chassis.

[0924] 4. NFS Read Load Balancing Algorithms:

[0925] The following read load balancing algorithms can be employed:

[0926] Round robin to each server within a virtual service

[0927] Configured weight of each server within a virtual service

[0928] Fastest response time determines weight of each server within avirtual service

[0929] New file handle round robin to a server within a virtual service,accesses to the same file handle are always directed to the same server

[0930] New file handle configured weight to a server within a virtualservice, accesses to the same file handle are always directed to thesame server

[0931] Heavily accessed file list split across multiple servers

[0932] Each of the algorithms above will be affected by server healthstatus along with previous traffic loads that have been forwarded.Servers may drop out of the server set if there is congestion or failureon the processor or associated disk subsystem.

XI. Fast-Path: Description of Illustrated Embodiments

[0933] The following description refers to examples of Fast-Pathimplemented in the Pirus Box and depicted in the attached FIGS. 45 and46. As noted above, however, the Fast-Path methods are not limited tothe Pirus Box, and can be implemented in substantially any TCP/UDPprocessing system, with different combinations of hardware and software,the selection of which is a matter of design choice. The salient aspectis that Fast-Path code is accelerated using distributed, synchronized,fast-path and slow-path processing, enabling TCP (and UDP) sessions torun faster and with higher reliability. The described methodssimultaneously maintain TCP state information in both the fast-path andthe slow-path, with control messages exchanged between fast-path andslow-path processing engines to maintain state synchronization and handoff control from one processing engine to another. These controlmessages can be optimized to require minimal processing in the slow-pathengines while enabling efficient implementation in the fast pathhardware. In particular, the illustrated embodiments provideacceleration in accordance with the following principles:

[0934] 1. Packet processing in a conventional TCP/IP stack is complexand time consuming. However, most packets do not represent anexceptional case and can be handled with much simpler and fasterprocessing. The illustrated embodiments (1) establish a parallel,fast-path TCP/IP stack that handles the majority of packets with minimalprocessing, (2) pass exceptions to the conventional (slow-path) stackfor further processing and (3) maintain synchronization between fast andslow paths.

[0935] 2. As a matter of design choice, the illustrated embodimentsemploy IXP micro-engines to execute header verification, flowclassification, and TCP/IP check-summing. The micro-engines can also beused for other types of TCP/IP processing. Processing is furtheraccelerated by this use of multiple, high-speed processors for routineoperations.

[0936]3. The described system also enables full control over theMediation applications described in other sections of this document.Limits can be placed on the behavior of such applications, furthersimplifying TCP/IP processing.

[0937] 1. Fast-Path Architecture Referring to FIG. 45, the illustratedFast-Path implementations in the Pirus Box include the following threeunits, the functions of which are described below:

[0938] 1. The Fast-Path module of the SRC card, which integrates theFast-Path TCP/IP stack. This module creates and destroys Fast-Pathsessions based on the TCP socket state, and executes TCP/UDP/IPprocessing for Fast-Path packets.

[0939] 2. Micro-engine code running on the IXPs. This element performsIP header verification, flow classification (by doing a four-tuplelookup in a flow forwarding table) and TCP/UDP check summing.

[0940] 3. IFF control code running on the IXP ARM. This modulecreates/destroys forwarding entries in the flow forwarding table basedon the IPC messages from the SRC.

[0941] 2. Fast-Path Functions

[0942] 2.1 LRC Processing:

[0943] Referring now to FIGS. 1 and 2, it will be seen that theillustrated embodiments of Fast-Path utilize both LRC and SRCprocessing. When VSEs (Virtual Storage Endpoints) are created, IPaddresses are assigned to each, and these IP addresses are added to theIFF forwarding is databases on all IXPs. For Mediation VSEs, forwardingtable entries will be labeled as Mediation in the correspondingdestination IPC service number. When the IXP Receive micro-enginereceives a packet from its Ethernet interface, it executes a lookup inthe IFF forwarding database. If a corresponding entry is found for thatpacket, and the associated destination service is Mediation, the packetis passed to the IXP Mediation micro-engine for Fast-Path processing.The IXP Mediation micro-engine first verifies the IP header forcorrectness (length, protocol, IP checksum, no IP options and the like),verifies TCP/UDP checksum, and then executes a flow lookup. If acorresponding entry is found, flow ID is inserted into the packet(overwriting the MAC address) and the packet is forwarded to theFast-Path service on the destination SRC. If a corresponding entry isnot found, the packet is forwarded to the IFF service on the destinationSRC.

[0944] 2.2 SRC processing:

[0945] Referring again to FIGS. 45 and 46, when the Fast-Path service onthe SRC receives packets from the IPC layer, the SRC extracts Session IDfrom the packet and uses it to look up socket and TCP control blocks. Itthen determines whether the packet can be processed by the Fast-Path:i.e., the packet is in sequence, no retransmission, no data queued inthe socket's Send buffer, no unusual flags, no options other thentimestamp, and timestamp is correct. If any condition is not met, thepacket is injected into the slow-path TCP input routine for fullprocessing. Otherwise, TCP counters are updated, ACK-ed data (if any) isreleased, an ACK packet is generated (if necessary), and the packet ishanded directly to the application.

[0946] 2.3 Session creation/termination:

[0947] In the illustrated embodiments, a Fast-Path session isestablished immediately after establishment of a standard TCP session(Inside Accept and Connect Calls); and destroyed just before the socketis closed (Inside Close Call). A socket's Send Call will be modified toattempt a Fast-Path Send from the user task's context all the way to theIPC. If Fast-Path fails, the job will fail back to the regular(slow-path) code path of the TCP Send Call, by sending a message to theTCP task. Conversely, the Fast-Path Receive routines, which can beexecuted from an interrupt or as a separate task, can forward receivedpackets to the user task's message queue, just as conventional TCPReceive processing does. As a result, from the perspective of the userapplication, packets received by the Fast-Path system areindistinguishable from packets received via the slow-path.

[0948] Referring again to FIGS. 45 and 46, at an initial time (i.e.,prior to Fast-Path session creation), there will be no entries in theflow forwarding table, and all packets will pass through the IFF/IP/TCPpath on the SRC as described in the other sections of this document.When a TCP (or UDP) connection is established, the TCP socket's codewill call Fast-Path code to create a Fast-Path session. When theFast-Path session is created, all IXPs will be instructed to create aflow forwarding table entry for the session. This ensures that if theroute changes and a different IXP begins to receive connection data,appropriate routing information will be available to the “new” IXP. (InIP architectures it is possible to have an asymmetric path, in whichoutgoing packets are sent to an IXP different from the one receiving theincoming packets. As a result, it would be insufficient to maintain aforwarding table only on the IXP that sends packets out.) Each time aMediation forwarding table entry is added to the associated IXP'sforwarding table, it will broadcast to all SRCs (or, in an alternativeembodiment, uni-cast to the involved SRC) a request to re-post anyexisting Fast-Path sessions for the corresponding address. This stepensures that when a new IXP is added (or crashes and is then re-booted),the pre-existing Fast-Path state is restored. Subsequently, when the TCP(or UDP) connection is terminated, the TCP sockets code will callFast-Path code to delete the previously-created Fast-Path session. AllIXPs then will be instructed to destroy the corresponding flowforwarding table entry.

[0949] In the case that an SRC processor crashed or was removed fromservice, the MIC module will detect the crash or removal, and issue acommand to remove the associated Mediation IP address. Similarly, if theSRC processor is restarted, it will issue a command to once again addthe corresponding Mediation IP address. When the IFF module on the IXPremoves the forwarding entry for the corresponding Mediation IP address,it will also remove all corresponding Fast-Path session forwardingentries.

[0950] 2.4 Session Control Blocks:

[0951] The described Fast-Path system maintains a table of Fast-PathSession Control blocks, each containing at least the followinginformation:

[0952] 1. Socket SID and SUID, for Fast TCP and Socket Control blocks inReceive operations.

[0953] 2. TCP/IP/Ethernet or UDP/IP/Ethernet header templates for Sendoperations.

[0954] 3. Cached IP next-hop information, including outgoing source anddestination MAC addresses, and the associated IXP's slot, processor andport numbers.

[0955] An index of the Session Control block serves as a Session ID,enabling rapid session lookups. When a Fast-Path Session is created, theSession ID is stored in the socket structure to enable quick sessionlookup during Sends.

[0956] 2.5 IXP Services:

[0957] Referring again to FIGS. 45 and 46, when a new Fast-Path sessionis established, the IXPs in the Pirus Box are set to forward TCP or UDPflow to a well-known Fast-Path service on the destination SRC processor.The associated IXP will insert an associated Fast-Path flow ID into thefirst word of the packet's Ethernet header (thereby overriding thedestination MAC address) to permit easy flow identification by theFast-Path processing elements. The IXP will execute a lookup of afour-tuple value (consisting of ip_src, ip_dst, port_srt, port_dst) inthe forwarding table to determine destination. (card, processor, flowID). In addition, the IXP will execute the following steps for packetsthat match the four-tuple lookup:

[0958] 1. Check IP header for correctness. Drop packet if this fails.

[0959] 2. Execute IP checksum. Drop packet if this fails.

[0960] 3. Confirm that there is/are no fragmentation or IP option. (As amatter of design choice, certain TCP options are permitted, fortimestamp and RDMA.) If this fails, forward the packet to the SRC “slowpath” (IFF on SRC).

[0961] 4. Execute TCP or UDP checksum. If this fails, send packet to aspecial error service on the SRC.

[0962] The IXP can also execute further TCP processing, including, butnot limited to, the following steps:

[0963] 1. Confirm that header length is correct.

[0964] 2. Confirm that TCP flags are ACK and nothing else.

[0965] 3. Confirm that the only option is TCP timestamp.

[0966] 4. Remember last window value and confirm that it has notchanged.

[0967] The IXP can also have two special well-known services:TCP_ADD_CHECKSUM and UDP_ADD_CHECKSUM. Packets sent to these serviceswill have TCP and IP, or UDP and IP checksums added to them. Thus, theillustrated Fast-Path embodiment can utilize a number of well-knownservices, including 2 on the IXP- IPC_SVC_IXP_TCP_CSUM adds TCP checksumto outbound packets IPC_SVC_IXP_UDP_CSUM adds UDP checksum to outboundpackets and 3 on the SRC: IPC_SVC_SRC_FP Fast-Path input IPC_SVC_SRC_SP“slow path” input IPC_SVC_SRC_FP_ERR error service that increments errorcounters.

[0968] 3. Further Fast Path Aspects

[0969] Referring again to FIGS. 45 and 46, all Fast-Path IPC services(i.e., each service corresponding to a TCP or UDP connection) will havethe same IPC callback routine. Flow ID can be readily extracted from theassociated Ethernet header information, and can be easily translatedinto socket descriptor/socket queue ID by executing a lookup in aFast-Path session table. Subsequently, both TCB and socket structurepointers can also be quickly obtained by a lookup.

[0970] Fast-Path processing will be somewhat different for TCP and UDP.In the case of UDP, Fast-Path processing of each packet can besimplified substantially to the updating of certain statistics. In thecase of TCP, however, a given packet may or may not be eligible forFast-Path processing, depending on the congestion/flow-control state ofthe connection. Thus, a Fast-Path session table entry will have afunction pointer for either TCP or UDP Fast-Path protocol handlerroutines, depending on the socket type. In addition, the TCP handlerwill determine whether a packet is Fast-Path eligible by examining theassociated Fast-Path connection entry, TCP header, TCP control block,and socket structure. If a packet is Fast-Path eligible, the TCP handlerwill maintain the TCP connection, and transmit control information tothe Mediation task's message queue. If the TCP stack's Send processneeds to be restarted, the TCP handler will send a message to the TCPstack's task to restart the buffered Send. Conversely, if a packet isnot eligible for Fast-Path, the TCP handler will send it to theslow-path IP task.

[0971] In the illustrated embodiments, the Socket Send Call checks todetermine whether the socket is Fast-Path enabled, and if it is, callsthe Fast-Path Send routine. The Fast-Path Send routine will obtainsocket and TCB pointers and will attempt to execute a TCP/IP shortcutand send the packet directly to the IPC. In order to leave a copy of thedata in the socket, in case TCP needs to retransmit, the Fast-Pathmodule will duplicate BJ and IBD, increment the REF count on the buffer,and add IBD to the socket buffer. The illustrated embodiments ofFast-Path do not calculate TCP and IP check-sums, but maintain twowell-known service numbers, TCP_CHECKSUM_ADD, UDP_CHECKSUM_ADD; and theIXP will add checksums on the packets received on these services. Thedestination IXP will be determined by referencing the source IXP of thelast received packet. If the Fast-Path system is unable to transmit thepacket directly to the IPC it will return an error code to the SocketSend Routine, which will then simply continue its normal code path andsend the packet to the slow-path TCP task's message queue for furtherprocessing.

[0972] To provide additional streamlining and acceleration of TCP/UDPpacket processing, a number of optional simplifications can be made. Forexample, the described Fast-Path does not itself handle TCP connectionestablishment and teardown. These tasks are handled by the conventionalTCP stack on the SRC. Similarly, the described Fast-Path does not itselfhandle IP options and IP fragmentation; these conditions are handled bythe conventional TCP stacks on both the LRC and the SRC. In theillustrated embodiments, Fast-Path handles the TCP timestamp option,while the conventional TCP stack on the SRC handles all other options.Similarly, the described Fast-Path system does not handle TCPretransmission and reassembly; these aspects are handled by theconventional TCP stack on the SRC. Certain security protocols, such asIPSec, change the IP protocol field and insert their own headers betweenthe IP and TCP headers. The illustrated Fast-Path embodiments can bemodified to handle this circumstance.

[0973] Fast-Path can be enabled by each socket's application on aper-socket basis. The system can be set to be disabled by default, andcan be enabled by doing a socket ioctl after a socket is created, butbefore a connection is established. Apart from this, the describedFast-Path is transparent for the socket application, from the viewpointof the socket interface.

[0974] The performance gains provided by Fast-Path are in part afunction of the number of TCP retransmissions in the network. Innetworks having a large number of packet drops, most of the packets willgo through the conventional TCP stack instead of the Fast-Path system.However, in a “good” LAN with limited packet drops, more than 90% ofpackets will go through Fast-Path, thus providing significantperformance improvements.

[0975] For example, the invention can be implemented in the Pirusinterconnection system described below and in U.S. provisional patentapplication No. 60/245,295 (referred to as the “Pirus Box”). The PirusBox routes, switches and bridges multiple protocols across FibreChannel, Gigabit Ethernet and SCSI protocols and platforms, therebyenabling interoperability of servers, NAS (network attached storage)devices, IP and Fibre Channel switches on SANs (storage area networks),WANs (wide area networks) or LANs (wide area networks). Within the PirusBox, multiple front-end controllers (IXPs) connect to a high-speedswitching fabric and point-to-point serial interconnect. Back-endcontrollers connect to switched Ethernet or other networks, managing theflow of data from physical storage devices.

[0976] In one implementation of the invention within the Pirus Box, theFast-Path includes Fast-Path code running on 750-series microprocessors,with hardware acceleration in IXP micro-engines. Alternatively, in aconfiguration having a close coupling between the IXP modules and theprocessors terminating TCP sessions, the Fast-Path code is executedtogether with the hardware acceleration in the IXP micro-engines. Ineach case, the described Fast-Path code can be highly optimized andplaced in gates or micro-engines. Such code will execute much fasterthan a conventional TCP/IP stack, even when running on the sameprocessor as a conventional stack.

[0977] The Fast-Path methods described herein are not limited to thePirus Box, but can be implemented in substantially any TCP/UDPprocessing system.

Glossary of Terms

[0978] Backplane—the Pirus box chassis is referred to herein abackplane; however, it will be recognized that the chassis couldalternatively be a midplane design.

[0979] CLI—Command Line Interface

[0980] FC—Fibre Channel

[0981] FSC—Fibre Channel Switching Card

[0982] IFF—Layer 2, 3, 4 and 5 Intelligent Filtering and Forwardingswitch

[0983] JBOD—Just a Bunch of Disks

[0984] LIC—LAN Interface Card

[0985] MAC—Media Access Control—usually refers to an Ethernet interfacechip

[0986] MIC—Management Interface Card

[0987] MTU—Maximum Transfer Unit—largest payload that can be sent on amedium.

[0988] NEC—Network Engine Card

[0989] NP—Network Processor

[0990] SCSI—Small Computer Systems Interface

[0991] SRC—Resource Module Card

[0992] uP—Microprocessor

[0993] ARP—Address Resolution Protocol

[0994] CLI—Command Line Interface

[0995] CONSOLE—System Console

[0996] CPCM—Card/Processor Configuration Manager

[0997] CSA—Configuration and Statistics Agent

[0998] CSM—Configuration and Statistics Manager

[0999] DC—Disk Cache

[1000] Eth Drver—Ethernet Driver

[1001] FC Nx—Fibre Channel Nx Port

[1002] FFS—Flash File System

[1003] FS—File System

[1004] HTTP—Hyper Text Transfer Protocol

[1005] HTTPS—Hyper Test Transfer Protocol Secured

[1006] IP—Internet Protocol

[1007] IPC—Inter Process Communication

[1008] L2—Layer 2

[1009] LHC—Local Hardware Control

[1010] LOGI—Logging Interface

[1011] MLAN—Management LAN

[1012] MNT—Mount

[1013] NFS—Network File Server

[1014] RCB—Rapid Control Backplane

[1015] RPC—Remote Procedure Call

[1016] RSS—Remote Shell Service

[1017] S2—System Services

[1018] SAM—System Abstraction Model

[1019] SB—Service Broker

[1020] SCSI—Small Computer System Interface

[1021] SFI—Switch Fabric Interface

[1022] SGLUE—SNMP Glue

[1023] SNMP—Simple Network Management Protocol

[1024] SSC—Server State Client

[1025] SSH—Secured Shell

[1026] SSM—Server State Manager

[1027] TCP—Transmission Control Protocol

[1028] UDP—User Datagram Protocol

[1029] VM—Volume Manager

[1030] WEBH—WEB Handlers

[1031] Configured Filesystem Server Set: The set of NAS servers thathave been configured by the user to serve copies of the filesystem. Alsoreferred to as a NAS peer group.

[1032] Current Filesystem Server Set: The subset of the configuredfilesystem server set that is made up of members that have synchronizedcopies of the filesystem.

[1033] Joining Filesystem Server Set: Members not part of the CurrentFilesystem Set that are in the process of joining that set.

[1034] Complete copy of a Filesystem: A copy of a filesystem containingfile data for all file inodes of a filesystem.

[1035] Construction of a Filesystem Copy: Building a sparse or completecopy of a filesystem by copying every element of the source filesystem.

[1036] Filesystem Checkpoint: NCM has insured that all members of thecurrent filesystem server set have the same copy of the filesystem. Anew filesystem checkpoint value was written to all copies and placed onstable storage. The filesystem modification sequence number on allmembers of the current filesystem server set is the same. The IN-MOD hasbeen cleared on all members of the current filesystem server set.

[1037] Filesystem Checkpoint Value: Filesystems and NVRAM are markedwith a filesystem checkpoint value to indicate when running copies ofthe filesystem were last checkpointed. This is used to identify stale(non-identical, non-synchronized) filesystems.

[1038] Filesystem Modification Sequence Number: The number of NFSmodification requests performed by a NAS server since the lastfilesystem checkpoint. Each NAS server is responsible for maintainingits own stable storage copy that is accessible to the NCM after afailure. The filesystem checkpoint value combined with this numberindicate which NAS server has the most recent copy of the filesystem.

[1039] Inode List Allocated (IN-Alloc): The list of inodes in afilesystem that have been allocated.

[1040] Inode List Content (IN-Con): The list of inodes in a filesystemthat have content present on a server; this must be a subset ofIN-Alloc. This will include every non-file (i.e. directory) inode. Ifthis is a Complete Copy of a Filesystem, then IN-Con is identical toIN-Alloc.

[1041] Inode List Copy (IN-Copy): Which inodes of a filesystem have beenmodified since we began copying the filesystem (duringConstruction/Restoration); in the disclosed embodiments, this must be asubset of IN-Con.

[1042] Inode List Modified (IN-Mod): Which inodes have been modifiedsince the last filesystem checkpoint. 2 filesystems with the samefilesystem checkpoint value should only differ by the changesrepresented by their modified InodeList. A Filesystem Checkpoint betweentwo filesystems means that each is a logical image of one another, andthe IN-Mod can be cleared.

[1043] NCM—NAS Coherency Manager: The Pirus chassis process that isresponsible for synchronizing peer NAS servers.

[1044] Peer NAS Server: Any CPU that is a member of a virtual storagetarget group (VST).

[1045] Recovery of a Filesystem Copy: Bringing an out of date filesystemcopy in sync with a later copy. This can be accomplished by constructionor restoration.

[1046] Restoration of a Filesystem Copy: Bringing a previously servedfilesystem from its current state to the state of an up to date copy bya means other than an element by element copy of the original.

[1047] Sparse copy of a filesystem: A copy of a filesystem containingfile data for less than all file inodes of a filesystem.

[1048] VST—Virtual Storage Target: As used herein, this term refers to agroup of NAS server CPUs within a Pirus chassis that creates theillusion of a single NAS server to an external client.

[1049] ARM, StrongARM processors: general-purpose processors withembedded networking protocols and/or applications compliant with thoseof ARM Holdings, PLC (formerly Advanced RISC Machines) of Cambridge,U.K.

[1050] BSD: sometimes referred to as Berkeley UNIX, an open sourceoperating system developed in the 1970s at U.C. Berkeley. BSD is foundin nearly every variant of UNIX, and is widely used for Internetservices and firewalls, timesharing, and multiprocessing systems.

[1051] IFF—Intelligent Forwarding and Filtering (described elsewhere inthis document in the context of the Pirus Box architecture).

[1052] IOCTL: A system-dependent device control system call, the ioctlfunction typically performs a variety of device-specific controlfunctions on device special files.

[1053] IPC: Inter-Process Communications. On the Internet, IPC isimplemented using TCP transport-layer protocol.

[1054] IPSec: IP security protocol, a standard used for interoperablenetwork encryption.

[1055] IXP: Internet Exchange Processors, such as Intel's IXP 1200Network Processors, can be used at various points in a network orswitching system to provide routing and other switching functions.Intel's IXP 1200, for example, is an integrated network processor basedon the StrongARM architecture and six packet-processing micro-engines.It supports software and hardware compliant with the Intel InternetExchange Architecture (IXA). See Pirus Box architecture describedelsewhere in this document.

[1056] LRC: LAN Resource Card. In the Pirus Box described herein, theLRC interfaces to external LANs, servers or WANS, performs loadbalancing and content-aware switching, implements storage mediationprotocols and provides TCP hardware acceleration in accordance with thepresent invention.

[1057] MAC address: Media Access Control address; a hardware addressthat uniquely identifies each node of a network.

[1058] Micro-engine: Micro-coded processor in the IXP. In oneimplementation of the Pirus Box, there are six in each IXP.

[1059] NFS: Network File Server

[1060] Protocol Mediation: applications and/or devices that translatebetween and among different protocols, such as TCP/IP, X.25, SNMP andthe like. Particular Mediation techniques and systems are describedelsewhere in this document in connection with the Pirus Box.

[1061] RDMA: Remote Direct Memory Access. The transfer of applicationdata from a remote buffer into a contiguous local buffer. Typicallyrefers to memory-to-memory copying between processors over TCP protocolssuch as HTTP and NFS across an Ethernet.

[1062] SCSI: Small Computer System Interface, widely-used ANSIstandards-based family of protocols for communicating with I/O devices,particularly storage devices.

[1063] iSCSI: Internet SCSI, a proposed transport protocol for SCSI thatoperates on top of TCP, and transmits native SCSI over a layer of the IPstack. The Pirus Box described herein provides protocol mediationservices to iSCSI devices and networks (“iSCSI Mediation Services”),using TCP/IP to provide LAN-attached servers with access toblock-oriented storage.

[1064] Silly Window Avoidance Algorithm (Send-Side): A technique inwhich the sender delays sending segments until it can accumulate areasonable amount of data in its output buffer. In some cases, a“reasonable amount” is defined to be a maximum-sized segment (MST).

[1065] SRC: Storage Resource Card. In the Pirus Box architecturedescribed herein, the SRC interfaces to external storage devices,provides NFS and CIFS services, implements IP to Fibre Channel (FC)storage mediation, provides volume management services (includingdynamic storage partitioning and JBOD (Just a Bunch of Disks)aggregation to create large storage pools), supports RAID functionalityand provides integrated Fibre Channel SAN switching.

[1066] TCP: Transmission Control Protocol, a protocol central to TCP/IPnetworks. TCP guarantees delivery of data and that packets will bedelivered in the same order in which they were sent.

[1067] TCP/IP: Transmission Control Protocol/Internet Protocol, thesuite of communications protocols used to connect hosts on the Internet.

[1068] UDP: User Datagram Protocol (UDP) supports a datagram mode ofpacket-switched communications in an interconnected set of computernetworks, and enables applications to message other programs with aminimum of protocol mechanism. UDP is considerably simpler than TCP andis useful in situations where the reliability mechanisms of TCP are notnecessary. The UDP header has only four fields: source port, destinationport, length, and UDP checksum.

[1069] VxWorks: a real-time operating system, part of the Tornado IIembedded development platform commercially available from WindRiverSystems, Inc. of Alameda, Calif., which is designed to enable developersto create complex real-time applications for embedded microprocessors.TABLE OF CONTENTS Incorporation by Reference/Priority Claim Field of theInvention Background of the Inventor Summary of the Invention BriefDescription of the Drawings Detailed Description of the Invention I.Overview II. Hardware/Software Architecture 1. Software ArchitectureOverview 1.1. System Services 1.1.1. SanStreaM (SSM) System Services(S2) 1.1.2. SSM Application Service (AS) 2. Management Interface Card2.1. Management Software 2.2. Management Software Overview 2.2.1. UserInterfaces (Uis) 2.2.2. Rapid Control Backplane (RBI) 2.2.3. SystemAbstraction Model (SAM) 2.2.4. Configuration & Statistics Manager (CSM)2.2.5. Logging / Billing (APIs) 2.2.6. Configuration & Statistics Agent(CSA) 2.3. Dynamic Configuration 2.4. Management Applications 2.4.1.Volume Manager 2.4.2. Load Balancer 2.4.3. Server-less Backup (NDMP)2.4.4. IP-ized Storage Management 2.4.5. Mediation Manager 2.4.6. VLANManager 2.4.7. File System Manager 2.5. Virtual Storage Domain (VSD)2.5.1. Services 2.5.2. Policies 2.6. Boot Sequence and Configuration 3.LIC Software 3.1. VLANs 3.1.1. Intelligent Filtering and Forwarding(IFF) 3.2. Load Balance Data Flow 3.3. LIC - NAS Software 3.3.1. VirtualStorage Domains (VSD) 3.3.2. Network Address Translation (NAT) 3.3.3.Local Load Balance (LLB) 3.3.3.1. Load Balancing Order of Operations3.3.3.2. File System Server Load Balance (FSLB) 3.3.3.3. NFS Server LoadBalancing (NLB) 3.3.3.4. TCP and UDP - Methods of Balancing 3.3.3.5.Write Replication 3.3.4. Load Balancer Failure Indication 3.3.4.1. CIFSServer Load Balancing 3.3.4.2. Content Load Balance 3.4. LIC - SCSI/IPSoftware 3.5. Network Processor Functionality 3.5.1. Flow Control3.5.1.1. Flow Definition 3.5.1.2. Flow Control Model 3.5.2. Flow Thru V.Buffering 3.5.2.1. Flow Thru 3.5.2.2. Buffering 4. SRC NAS (SoftwareFeatures) 4.1. SRC NAS Storage Features 4.1.1. Volume Manager 4.1.2.Disk Cache 4.1.3. SCSI 4.1.4. Fibre Channel 4.1.5. Switch FabricInterface 4.2. NAS Pirus System Features 4.2.1. Configuration/Statistics4.2.2. NSF Load Balancing 4.2.3. NFS Mirroring Service 5. SRC Mediation5.1. Supported Mediation Protocols 5.1.1. SCSI/UDP 5.2. StorageComponents 5.2.1. SCSI/IP Layer 5.2.2. SCSI Mediator 5.2.3. VolumeManager 5.2.4. SCSI Originator 5.2.5. SCSI Target 5.2.6. Fibre Channel5.3. Mediation Example III. NFS Load Balancing 1. Operation 1.1. ReadRequests 1.2. Determining the Number of Servers for a File 1.3. ServerLists 1.3.1. Single Server List 1.3.2. Multiple Server Lists 1.4.Synchronizing Lists Across Multiple IXP's IV. Intelligent Forwarding andFiltering 1. Definitions 2. Virtual Domains 2.1. Network AddressTranslation 3. VLAN Definition 3.1. Default VLAN 3.2. ServerAdministration VLAN 3.3. Server Access VLAN 3.4. Port Types 3.4.1.Router Port 3.4.2. Server Port 3.4.3. Combo Port 3.4.4. ServerAdministration Port 3.4.5. Server Access Port 3.4.6. Example of VLAN 4.Filtering Function 5. Forwarding Function 5.1. Flow Entry Description5.1.1. Source IP Address 5.1.2. Destination IP Address 5.1.3.Destination TCP/UDP port 5.1.4. Source physical port 5.1.5. Sourcenext-hop MAC address 5.1.6. Destination physical port 5.1.7. Destinationnext-hop MAC address 5.1.8. NAT IP address 5.1.9. NAT TCP/UDP port5.1.10. Flags 5.1.11. Received pack ts 5.1.12. Transmitted pack ts5.1.13. R ceived bytes 5.1.14. Transmitted bytes 5.1.15. N xt point r(receive path) 5.1.16. Next pointer (transmit path) 5.2. AddingForwarding Entries 5.2.1. Client IP Addresses 5.2.2. Virtual Domain IPAddresses 5.2.3. Server IP Addresses 5.3. Distribute the ForwardingTable 5.4. Ingress Function 6. Egress Function V. IP-Based StorageManagement - Device Discovery & Monitoring Examples: Server HealthMediation Target Mediation Initiator NCM VI. DATA STRUCTURE LAYOUT 1.VSD_CFG_T 2. VSE_CFG_T 3. SERVER_CFG_T 4. MED_TARG_CFG_T 5.LUN_MAP_CFG_T 6. FILESYS_CFT_T VII. NAS Mirroring and ContentDistribution 1. Content Distribution and Mirroring 2. MirrorInitialization via NAS Mirror Initialization via NDMP Sparse ContentDistribution NCM NCM Objectives NCM Architecture NCM Processes andLocations NCM and IPC Services NCM and Inode Management Inode AllocationSynchronization Inode Inconsistency Identification 3. Filesystem ServerSets 3.1. Types 3.2. State of the Current Server Set 4. Description ofOperations Create_Current_Filesystem_Server_Set (fsid, slots/cpus)Add_Member_To_Current Filesystem_Server_Set (fsid, slot/cpu) 5.Description of Operations that Change the State of the Current ServerSet Activate_Server_Set (fsid) Pause Filesystem Server Set (fsid)Continue Filesystem Server Set (fsid) Deactivate_Server_Set (fsid) 6.Recovery Operations on a Filesystem Copy Construction RestorationNAS-FS-Copy Construction of Complete Copy Copy Method Special InodesLocking Restoration of Complete Copy Data structuresModified-Inodes-list (IN-Mod) Copy-Inodes-list (IN-Copy) Copy progressCopying Inodes Construction case Restoration Case Examples - Set 1Examples - Set 2 VIII. System Mediation Manager Components StorageHierarchy Functional Specification Functional Requirement Design DataStructures Flow Chart X. Mediation Caching XI. Server Health MonitoringOperation with Network Access Server (NAS) Operation with iSCS/MediationDevices Forwarding of NFS Traffic NFS R ad Load Balancing AlgorithmsXII. Fast-Path: Description of Illustrated Embodiments Fast-PathArchitecture Fast-Path Functions LRC Processing SRC processing Sessioncreation/termination Session Control Blocks IXP Services Further FastPath Aspects ABSTRACT

We claim:
 1. In a digital network including at least first and secondClient Servers, each of the first and second Client Servers beingoperable to communicate with (1) respective local clients and (2) aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the digital network being operableto provide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating read access to data by clients, the method comprising:providing, for each of the first and second Client Servers, a respectivelocal read cache operable to communicate with the Client Server,operable to store a copy of recently read data; providing, for each ofthe first and second Client Servers, a respective local write cacheoperable to communicate with the Client Server, operable to store a copyof data to be written; receiving a read access request from a client incommunication with the first or second Client Server; in response toreceipt of the read access request, checking the local write cache for adata segment match; if no data segment match is found in the local writecache, checking the local read cache for a data segment match; if thesegment is found in the local cache, transmitting to the remote DataServer a request to determine the validity of the data in the local readcache, thereby to determine whether the data in the local read cachemust be updated from the remote Data Server, if the data in the localread cache is not valid, or if no data segment match is found in thelocal read cache, transmitting the read access request to the remoteData Server for serving of the requested data; and once the requesteddata is transmitted from the remote Data Server, storing a copy of therequested data in the local read cache.
 2. The method of claim 1,further comprising: assigning a time-stamp to a data segment stored inthe local read cache; assigning a time-stamp to a data segment stored inthe remote Data Server; upon receipt of a request to determine thevalidity of the data in the local read cache, comparing the time-stampof the data segment stored in the local read cache with the time-stampof a comparable data segment stored in the remote Data Server todetermine whether the data segment in the local read cache is older thanthe comparable data segment on the remote data server; and if the datasegment in the local read cache is older than the comparable datasegment on the remote data server, designating as invalid the datasegment in the local read cache.
 3. The method of claim 2, furthercomprising: if the data segment in the local read cache is designatedinvalid, then transmitting the read access request to the remote DataServer for serving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the local read cache and updating the respective time-stamps.Multiple Client Servers—Writes
 4. In a digital network including atleast first and second Client Servers, each of the first and secondClient Servers being operable to communicate with (1) respective localclients and (2) a remote Data Server to request access to data files onstorage devices connected to the remote Data Server, the digital networkbeing operable to provide mediation between storage and networkingprotocols used for communication between clients, servers and storagedevices, a method of accelerating response to write access requests byclients, the method comprising: providing, for each of the first andsecond Client Servers, a respective local read cache operable tocommunicate with the Client Server, operable to store a copy of recentlyread data; providing, for each of the first and second Client Servers, arespective local write cache operable to communicate with the ClientServer, operable to store a copy of data to be written; receiving awrite request from a client in communication with the first or secondClient Server; in response to receipt of the write request, checking therespective local read cache for a data segment match, and if a matchingdata segment is detected, invalidating the matching data segment;checking the respective local write cache for a data segment match, andif a matching write segment is detected, invalidating or reusing thematching write segment; transmitting to the remote Data Server a requestto determine whether the write segment is available for writing, and ifthe segment is unavailable, waiting for the write segment to becomeavailable; generating a new write cache entry representing the writedata segments to be written; and transmitting to the remote Data Servera request to unlock the data segments to be written.
 5. The method ofclaim 4 further comprising: determining whether data segments areavailable for writing by the first Client Server by checking whether thedata segments are being written or otherwise are locked by the secondClient Server during the time the first Client Server requests writeaccess to the same or overlapping data segments.
 6. The method of claim5 wherein the determining further comprises generating a lock request onthe requested data segment if the segment is available.
 7. In aswitching system adapted to interconnect local clients in communicationwith a Client Server, the Client Server being operable to communicatewith a remote Data Server to request access to data files on storagedevices connected to the remote Data Server, the switching system beingoperable to provide mediation between storage and networking protocolsused for communication between clients, servers and storage devices, amethod of accelerating read access to data by clients, the methodcomprising: providing a local data cache operable to communicate withthe Client Server; storing within the local data cache a copy of datarecently read from the remote Data Server; determining, when a client incommunication with the Client Server requests data by means of a readrequest, whether the requested data is present in the local data cache;and if the requested data is present in the local data cache, providingthe client access to the cached data from the local data cache; or, ifthe requested data is not present in the local data cache, transmittingthe read request to the remote Data Server for serving of the requesteddata by the remote Data Server, and, once the requested data istransmitted to the Client Server, storing a copy of the requested datain the local data cache.
 8. In a switching system adapted tointerconnect local clients in communication with a Client Server, theClient Server being operable to communicate with a remote Data Server torequest access to data files on storage devices connected to the remoteData Server, the switching system being operable to provide mediationbetween storage and networking protocols used for communication betweenclients, servers and storage devices, a method of accelerating responseto write access requests by clients, the method comprising: providing alocal data cache operable to communicate with the Client Server; whendata to be written is received from a client, storing the data in thelocal data cache and transmitting to the client a write operationcompletion signal upon storage of the data in the local data cache; andsubsequently transmitting the data to be written to the remote DataServer for writing out to the storage devices connected thereto.
 9. Themethod of claim 14 further comprising: accumulating in the local datacache multiple segments of data to be written; and subsequentlytransmitting the multiple segments of data to be written in a batchoperation to the remote Data Server.
 10. The method of claim 9 whereinthe multiple segments of data to be written are transmitted in asemi-contiguous write access to the remote Data Server.
 11. In aswitching system adapted to interconnect local clients in communicationwith a Client Server, the Client Server being operable to communicatewith a remote Data Server to request access to data files on storagedevices connected to the remote Data Server, the switching system beingoperable to provide mediation between storage and networking protocolsused for communication between clients, severs and storage devices, amethod of accelerating read access to data by clients, the methodcomprising: providing a local read cache operable to communicate withthe Client Server, operable to store a copy of recently read data;providing a local write cache operable to communicate with the ClientServer, operable to store a copy of data to be written; storing withinthe local data cache a copy of data recently read from the remote DataServer; receiving a read access request from a client in communicationwith the Client Server; in response to receipt of the read accessrequest, checking the local write cache for a data segment match; if nodata segment match is found in the local write cache, checking the localread cache for a data segment match; if no data segment match is foundin the local read cache, transmitting the read access request to theremote Data Server for serving of the requested data; and once therequested data is transmitted from the remote Data Server, storing acopy of the requested data in the local read cache.
 12. In a switchingsystem adapted to interconnect local clients in communication with aClient Server, the Client Server being operable to communicate with aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the switching system being operableto provide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating response to write access requests by clients, the methodcomprising: providing a local read cache operable to communicate withthe Client Server, operable to store a copy of recently read data;providing a local write cache operable to communicate with the ClientServer, operable to store a copy of data to be written; receiving awrite request from a client in communication with the Client Server, inresponse to receipt of the write request, checking the local read cachefor a matching read segment and, if a matching read segment is detected,invalidating the matching read segment, checking the local write cachefor matching write segments and, if a matching write segment isdetected, invalidating or reusing the write segment, and generating anew local write cache entry representing the write data segment.
 13. Ina switching system including at least first and second Client Servers,each of the first and second Client Servers being operable tocommunicate with (1) respective local clients and (2) a remote DataServer to request access to data files on storage devices connected tothe remote Data Server, the switching system being operable to providemediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating read access to data by clients, the method comprising:providing, for each of the first and second Client Servers, a respectivelocal read cache operable to communicate with the Client Server,operable to store a copy of recently read data; providing, for each ofthe first and second Client Servers, a respective local write cacheoperable to communicate with the Client Server, operable to store a copyof data to be written; receiving a read access request from a client incommunication with the first or second Client Server; in response toreceipt of the read access request, checking the local write cache for adata segment match; if no data segment match is found in the local writecache, checking the local read cache for a data segment match; if thesegment is found in the local cache, transmitting to the remote DataServer a request to determine the validity of the data in the local readcache, thereby to determine whether the data in the local read cachemust be updated from the remote Data Server, if the data in the localread cache is not valid, or if no data segment match is found in thelocal read cache, transmitting the read access request to the remoteData Server for serving of the requested data; and once the requesteddata is transmitted from the remote Data Server, storing a copy of therequested data in the local read cache.
 14. The method of claim 13,further comprising: assigning a time-stamp to a data segment stored inthe local read cache; assigning a time-stamp to a data segment stored inthe remote Data Server; upon receipt of a request to determine thevalidity of the data in the local read cache, comparing the time-stampof the data segment stored in the local read cache with the time-stampof a comparable data segment stored in the remote Data Server todetermine whether the data segment in the local read cache is older thanthe comparable data segment on the remote data server; and if the datasegment in the local read cache is older than the comparable datasegment on the remote data server, designating as invalid the datasegment in the local read cache.
 15. The method of claim 14, furthercomprising: if the data segment in the local read cache is designatedinvalid, then transmitting the read access request to the remote DataServer for serving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the local read cache and updating the respective time-stamps.16. In a switching system connectable to at least first and secondClient Servers, each of the first and second Client Servers beingoperable to communicate with (1) respective local clients and (2) aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the switching system being operableto provide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating response to write access requests by clients, the methodcomprising: providing, for each of the first and second Client Servers,a respective local read cache operable to communicate with the ClientServer, operable to store a copy of recently read data; providing, foreach of the first and second Client Servers, a respective local writecache operable to communicate with the Client Server, operable to storea copy of data to be written; receiving a write request from a client incommunication with the first or second Client Server; in response toreceipt of the write request, checking the respective local read cachefor a data segment match, and if a matching data segment is detected,invalidating the matching data segment; checking the respective localwrite cache for a data segment match, and if a matching write segment isdetected, invalidating or reusing the matching write segment;transmitting to the remote Data Server a request to determine whetherthe write segment is available for writing, and if the segment isunavailable, waiting for the write segment to become available;generating a new write cache entry representing the write data segmentsto be written; and transmitting to the remote Data Server a request tounlock the data segments to be written.
 17. The method of claim 16further comprising: determining whether data segments are available forwriting by the first Client Server by checking whether the data segmentsare being written or otherwise are locked by the second Client Serverduring the time the first Client Server requests write access to thesame or overlapping data segments.
 18. The method of claim 17 whereinthe determining further comprises generating a lock request on therequested data segment if the segment is available.
 19. In a digitalnetwork having at least first and second Client Servers, each of thefirst and second Client Servers being operable to communicate with (1)respective local clients and (2) a remote Data Server to request accessto data files on storage devices connected to the remote Data Server,the network being operable to provide mediation between storage andnetworking protocols used for communication between clients, servers andstorage devices, a system for accelerating read access to data byclients, the system comprising: means for providing, for each of thefirst and second Client Servers, a respective local read cache operableto communicate with the Client Server, operable to store a copy ofrecently read data; means for providing, for each of the first andsecond Client Servers, a respective local write cache operable tocommunicate with the Client Server, operable to store a copy of data tobe written; means for receiving a read access request from a client incommunication with the first or second Client Server; means for, inresponse to receipt of the read access request, checking the local writecache for a data segment match; if no data segment match is found in thelocal write cache, checking the local read cache for a data segmentmatch; if the segment is found in the local cache, transmitting to theremote Data Server a request to determine the validity of the data inthe local read cache, thereby to determine whether the data in the localread cache must be updated from the remote Data Server, if the data inthe local read cache is not valid, or if no data segment match is foundin the local read cache, transmitting the read access request to theremote Data Server for serving of the requested data; and means for,once the requested data is transmitted from the remote Data Server,storing a copy of the requested data in the local read cache.